voxel51 / fiftyone

The open-source tool for building high-quality datasets and computer vision models
https://fiftyone.ai
Apache License 2.0
8.08k stars 537 forks source link

[FR] Improvement FiftyOne App performance on large datasets #1740

Open alen-smajic opened 2 years ago

alen-smajic commented 2 years ago

System information

Commands to reproduce

dataset = fo.load_dataset(dataset_name)  # Loads the FiftyOne dataset instance

# Filters out low confidence detections made by the models, based on the model specific confidence thresholds, which are specified in the configurations
for model in prediction_fields:
        dataset = dataset.filter_labels(
            prediction_fields[model], F('confidence') >= conf_thresholds[dataset_name][model], only_matches=False
        )

print(dataset)  # Logs the dataset
Dataset:     KIA-Pedestrian-Detection
Media type:  image
Num samples: 145518
Tags:        ['KIA_test_split']
Sample fields:
    id:                                          fiftyone.core.fields.ObjectIdField
    filepath:                                    fiftyone.core.fields.StringField
    tags:                                        fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:                                    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth:                                fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    company_name:                                fiftyone.core.fields.StringField
    tranche:                                     fiftyone.core.fields.StringField
    sequence:                                    fiftyone.core.fields.IntField
    file_name:                                   fiftyone.core.fields.StringField
    fog_intensity:                               fiftyone.core.fields.IntField
    vignette_intensity:                          fiftyone.core.fields.IntField
    wetness_type:                                fiftyone.core.fields.StringField
    wetness_intensity:                           fiftyone.core.fields.IntField
    puddles_intensity:                           fiftyone.core.fields.IntField
    lens_flare_intensity:                        fiftyone.core.fields.IntField
    brightness:                                  fiftyone.core.fields.FloatField
    contrast:                                    fiftyone.core.fields.FloatField
    edge_strength:                               fiftyone.core.fields.FloatField
    pedestrian_amount:                           fiftyone.core.fields.IntField
    safety_relevant_pedestrian_amount:           fiftyone.core.fields.IntField
    daytime_type:                                fiftyone.core.fields.StringField
    sky_type:                                    fiftyone.core.fields.StringField
    sun_visible:                                 fiftyone.core.fields.BooleanField
    prediction_FasterRCNN-ResNet50-OD:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    prediction_FCOS-ResNet50-OD:                 fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    prediction_RetinaNet-ResNet50-OD:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    prediction_SSD300-ResNet50-OD:               fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    prediction_KeypointRCNN-ResNet50-KD:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_prediction_FasterRCNN_ResNet50_OD_tp:   fiftyone.core.fields.IntField
    eval_prediction_FasterRCNN_ResNet50_OD_fp:   fiftyone.core.fields.IntField
    eval_prediction_FasterRCNN_ResNet50_OD_fn:   fiftyone.core.fields.IntField
    eval_prediction_FCOS_ResNet50_OD_tp:         fiftyone.core.fields.IntField
    eval_prediction_FCOS_ResNet50_OD_fp:         fiftyone.core.fields.IntField
    eval_prediction_FCOS_ResNet50_OD_fn:         fiftyone.core.fields.IntField
    eval_prediction_RetinaNet_ResNet50_OD_tp:    fiftyone.core.fields.IntField
    eval_prediction_RetinaNet_ResNet50_OD_fp:    fiftyone.core.fields.IntField
    eval_prediction_RetinaNet_ResNet50_OD_fn:    fiftyone.core.fields.IntField
    eval_prediction_SSD300_ResNet50_OD_tp:       fiftyone.core.fields.IntField
    eval_prediction_SSD300_ResNet50_OD_fp:       fiftyone.core.fields.IntField
    eval_prediction_SSD300_ResNet50_OD_fn:       fiftyone.core.fields.IntField
    eval_prediction_KeypointRCNN_ResNet50_KD_tp: fiftyone.core.fields.IntField
    eval_prediction_KeypointRCNN_ResNet50_KD_fp: fiftyone.core.fields.IntField
    eval_prediction_KeypointRCNN_ResNet50_KD_fn: fiftyone.core.fields.IntField
    prediction_MaskRCNN-ResNet50-IS:             fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_prediction_MaskRCNN_ResNet50_IS_tp:     fiftyone.core.fields.IntField
    eval_prediction_MaskRCNN_ResNet50_IS_fp:     fiftyone.core.fields.IntField
    eval_prediction_MaskRCNN_ResNet50_IS_fn:     fiftyone.core.fields.IntField
View stages:
    1. FilterLabels(field='prediction_F...N-ResNet50-OD', filter={'$gte': ['$$this.confidence', 0.9]}, only_matches=False, trajectories=False)
    2. FilterLabels(field='prediction_M...N-ResNet50-IS', filter={'$gte': ['$$this.confidence', 0.9]}, only_matches=False, trajectories=False)
    3. FilterLabels(field='prediction_K...N-ResNet50-KD', filter={'$gte': ['$$this.confidence', 0.9]}, only_matches=False, trajectories=False)
    4. FilterLabels(field='prediction_FCOS-ResNet50-OD', filter={'$gte': ['$$this.confidence', 0.55]}, only_matches=False, trajectories=False)
    5. FilterLabels(field='prediction_R...t-ResNet50-OD', filter={'$gte': ['$$this.confidence', 0.45]}, only_matches=False, trajectories=False)
    6. FilterLabels(field='prediction_S...0-ResNet50-OD', filter={'$gte': ['$$this.confidence', 0.3]}, only_matches=False, trajectories=False)

Describe the problem

I am working currently on a large pedestrian detection dataset (~145k test samples with over 5 Mil. pedestrian bboxes with several attributes annotated per bbox), for which I am using the FiftyOne toolkit. I have also already inferred all the data into 6 different detection models and run the evaluations (all of this within fo). All of this is stored on a remote server. Now that I have finished all the uploading, I am having problems with loading the FiftyOne app or working with the dataset instance at all. It just takes too long to load (even after 10min of waiting the FiftyOne app still loads) and also creating views into the dataset takes several minutes. I am also working on a smaller dataset (5000 samples), which is perfectly fine (so the connection to the remote server is stable enough). While I understand that working with a bigger dataset takes of course more time to load, I have seen that FiftyOne also supports other large datasets like COCO or Open Images and no one was complaining about the loading speed, so does anyone have suggestions what could be the reason or how to speed it up. I could also delete some of the sample fields or detection attributes if that would help.

Code to reproduce issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Please do not use screenshots for sharing text. Code snippets should be used instead when providing tracebacks, logs, etc.

What areas of FiftyOne does this bug affect?

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?

brimoor commented 2 years ago

Optimizing the App for large datasets is definitely something that we're working on, but it'll take some time.

A workaround for now is to use limit() and select_fields() to reduce the amount of data that you're trying to view in the App at any one time.

# If this is too slow
session = fo.launch_app(dataset)

# This will be faster
session.view = dataset.limit(1000).select_fields(["fields", "I", "care", "about"])
alexk-ede commented 2 years ago

Hi everyone, yes I agree, quite a few improvements are needed here.

Not sure if it's worth opening a new issue bc it is kind of related.

I'm currently working with the oi6 dataset, and downloaded it a while ago on one machine. Now I wanted to use it on another machine, and realized, while the data was moved via an external drive, fiftyone is still "loading" the dataset. (Seems there is no dl taking place, but it's loading every example in its own database.) File access on Windoze is around 20-30 samples/sec, on Linux around 30-40 samples/sec and takes several hours to finish. (Despite fast multicore cpu + nvme ssd) FO uses obviously only one thread which explains the speed, though it should be possible to at least speed up the loading by using threads (maybe the dl process, too)

alexk-ede commented 2 years ago

I had a quick look at it again, as I'm testing it in inside docker now. coco-2017 import is more or less ok, so I can live with that, bc a 10x speedup would result 30s instead of 5min. Would be ofc desirable but not my main focus right now. 73861/73861 [4.7m elapsed, 0s remaining, 294.1 samples/s]

But there is indeed a big issue with the open images v6 dataset. 425339/425339 [5.6h elapsed, 0s remaining, 26.1 samples/s] That's extremely slow and it's really just a fraction of the dataset.

Is there anything that can be done to speed that up ? Esp. regarding multicore ...

wematan commented 1 year ago

Same here, large dataset (500K samples) takes hours to import to fo. tried using beam integration (docs are not updated so just followed the instruction in code comments), it failed to import at the end of the process, what is more annoying is that you can't track progress of the samples loading phase.

rusmux commented 1 year ago

Is there any way to make loading datasets faster? Loading BDD100K takes about 30 minutes

stwerner97 commented 5 months ago

It would be great to see some more awareness on this issue. The loading times make fiftyone tough to use for medium to large scale datasets.

benjaminpkane commented 5 months ago

Hi all. Scalability is a focus for Teams. Please reach out if you have a team!

dinigo commented 1 month ago

@benjaminpkane , we're experiencing the same issue. I have a potential team, we're balancing this against other options (sharding mongo). How does the "teams" version solves the issue?

benjaminpkane commented 1 month ago

Hi @dinigo. If you join our Slack community and reach out to a Voxel51 team member we can dig into your use case more.