Open alen-smajic opened 2 years ago
Optimizing the App for large datasets is definitely something that we're working on, but it'll take some time.
A workaround for now is to use limit() and select_fields() to reduce the amount of data that you're trying to view in the App at any one time.
# If this is too slow
session = fo.launch_app(dataset)
# This will be faster
session.view = dataset.limit(1000).select_fields(["fields", "I", "care", "about"])
Hi everyone, yes I agree, quite a few improvements are needed here.
Not sure if it's worth opening a new issue bc it is kind of related.
I'm currently working with the oi6 dataset, and downloaded it a while ago on one machine. Now I wanted to use it on another machine, and realized, while the data was moved via an external drive, fiftyone is still "loading" the dataset. (Seems there is no dl taking place, but it's loading every example in its own database.) File access on Windoze is around 20-30 samples/sec, on Linux around 30-40 samples/sec and takes several hours to finish. (Despite fast multicore cpu + nvme ssd) FO uses obviously only one thread which explains the speed, though it should be possible to at least speed up the loading by using threads (maybe the dl process, too)
I had a quick look at it again, as I'm testing it in inside docker now. coco-2017 import is more or less ok, so I can live with that, bc a 10x speedup would result 30s instead of 5min. Would be ofc desirable but not my main focus right now. 73861/73861 [4.7m elapsed, 0s remaining, 294.1 samples/s]
But there is indeed a big issue with the open images v6 dataset. 425339/425339 [5.6h elapsed, 0s remaining, 26.1 samples/s] That's extremely slow and it's really just a fraction of the dataset.
Is there anything that can be done to speed that up ? Esp. regarding multicore ...
Same here, large dataset (500K samples) takes hours to import to fo. tried using beam integration (docs are not updated so just followed the instruction in code comments), it failed to import at the end of the process, what is more annoying is that you can't track progress of the samples loading phase.
Is there any way to make loading datasets faster? Loading BDD100K takes about 30 minutes
It would be great to see some more awareness on this issue. The loading times make fiftyone
tough to use for medium to large scale datasets.
Hi all. Scalability is a focus for Teams. Please reach out if you have a team!
@benjaminpkane , we're experiencing the same issue. I have a potential team, we're balancing this against other options (sharding mongo). How does the "teams" version solves the issue?
Hi @dinigo. If you join our Slack community and reach out to a Voxel51 team member we can dig into your use case more.
You could try reducing the size of the images displayed:
# ==== Dataset speed up using thumbnails ===== #
import fiftyone as fo
import fiftyone.utils.image as foui
import fiftyone.zoo as foz
dataset = fo.load_dataset("Tomg_9_11_2024")
foui.transform_images(
dataset,
size=(-1, 320), # Reducing images to have a max dim of 320px
output_field="thumbnail_path",
output_dir="/tmp/thumbnails",
)
dataset.app_config.media_fields = ["filepath", "thumbnail_path"]
dataset.app_config.grid_media_field = "thumbnail_path"
dataset.save() # must save after edits
System information
fiftyone --version
): v0.15.0.1Commands to reproduce
Describe the problem
I am working currently on a large pedestrian detection dataset (~145k test samples with over 5 Mil. pedestrian bboxes with several attributes annotated per bbox), for which I am using the FiftyOne toolkit. I have also already inferred all the data into 6 different detection models and run the evaluations (all of this within fo). All of this is stored on a remote server. Now that I have finished all the uploading, I am having problems with loading the FiftyOne app or working with the dataset instance at all. It just takes too long to load (even after 10min of waiting the FiftyOne app still loads) and also creating views into the dataset takes several minutes. I am also working on a smaller dataset (5000 samples), which is perfectly fine (so the connection to the remote server is stable enough). While I understand that working with a bigger dataset takes of course more time to load, I have seen that FiftyOne also supports other large datasets like COCO or Open Images and no one was complaining about the loading speed, so does anyone have suggestions what could be the reason or how to speed it up. I could also delete some of the sample fields or detection attributes if that would help.
Code to reproduce issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Please do not use screenshots for sharing text. Code snippets should be used instead when providing tracebacks, logs, etc.
What areas of FiftyOne does this bug affect?
App
: FiftyOne application issueCore
: Corefiftyone
Python library issueServer
: Fiftyone server issueWillingness to contribute
The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?