voxel51 / fiftyone

The open-source tool for building high-quality datasets and computer vision models
https://fiftyone.ai
Apache License 2.0
8.12k stars 541 forks source link

[BUG] Unable to sort large datasets #448

Open ehofesmann opened 4 years ago

ehofesmann commented 4 years ago

Describe the problem

When trying to sort a large dataset (>100MB) MongoDB will throw an error: OperationFailure: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting

This came up when running the evaluate detections tutorial on 5000 samples in coco-2017.

One way around this is to set allowDiskUse: True for the sorting pipeline stage but this will likely cause the sorting to take a long time. https://docs.mongodb.com/manual/reference/operator/aggregation/sort/#sort-and-memory-restrictions

Another, possibly faster way, could be to create a MongoDB index on whatever field you are sorting by first and then sorting by that to avoid the memory requirements. In the evaluate detections tutorial, the error arose when sorting by the newly created field tp_iou_0_75, but I could still sort by filepath most likely because an index exists for it.

Code to reproduce issue

https://voxel51.com/docs/fiftyone/tutorials/evaluate_detections.html

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What areas of FiftyOne does this bug affect?

ehofesmann commented 4 years ago

I was able to create an index on 5000 samples in 0.1 seconds with this:

import fiftyone as fo
import fiftyone.core.odm as foo

conn = foo.get_db_conn()

collection = conn[dataset._doc.sample_collection_name]
collection.create_index("yolov4.ground_truth_eval.true_positives.0_75")

This is probably the way to go fixing this issue.

benjaminpkane commented 4 years ago

Yes, yes, and yes to indexes!!!