Closed SiftingSands closed 3 years ago
Why are you wanting to export the dataset in the first place? Are you aware you can make datasets persistent?
import fiftyone
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart-video")
dataset.name = "my-dataset"
dataset.persistent = True
Then, in another Python session:
import fiftyone as fo
dataset = fo.load_dataset("my-dataset")
As for the error you're seeing, that is happening in fo.Dataset()
(the rest is not relevant) because you are using parallel processes to create the datasets.
When datasets are created, the current timestamp is used to generate some collection names for them. If multiple datasets are created within the same second, then milliseconds are appended to disambiguate. However, when you use multiprocessing, the creation code is running so simultaneously that multiple processes think it's okay to use the same collection names because each of them decide this before any of them manage to actually insert their collections into the DB. The relevant code is here:
If this parallelized dataset creation is really essential, we can consider making the collection name generation thread safe, but I tend to think that there's a better way to achieve your desired workflow. Let's see
btw you might find this Apache Beam parallelization work interesting: https://github.com/voxel51/fiftyone/pull/1370
I appreciate the detailed responses. For some reason, my thought process was locked into exporting to disk instead of just using the persistence feature you described and reloading things.
Closing as I can bypass the duplicate name error.
Since
fo.Sample
currently can't be pickled (throws a recursion limit error), I tried to export my dataset in chunks (merge later if needed). However, I came across the following error.Here's the relevant parts of my export code. It appears that the mongodb key is set to a timestamp by default; is there a way to override/bypass this?
Thanks!