voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.86k stars 560 forks source link

[BUG] Duplicate collection name error when trying to create datasets in parallel #1375

Closed SiftingSands closed 3 years ago

SiftingSands commented 3 years ago

Since fo.Sample currently can't be pickled (throws a recursion limit error), I tried to export my dataset in chunks (merge later if needed). However, I came across the following error.

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/fiftyone/core/odm/document.py", line 406, in save
    object_id = collection.insert_one(doc).inserted_id
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/collection.py", line 705, in insert_one
    self._insert(document,
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/collection.py", line 620, in _insert
    return self._insert_one(
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/collection.py", line 609, in _insert_one
    self.__database.client._retryable_write(
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1552, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1438, in _retry_with_session
    return self._retry_internal(retryable, func, session, bulk)
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1470, in _retry_internal
    return func(session, sock_info, retryable)
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/collection.py", line 607, in _insert_command
    _check_write_command_response(result) 
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/helpers.py", line 229, in _check_write_command_response
    _raise_last_write_error(write_errors) 
  File "/home/user/.local/lib/python3.8/site-packages/pymongo/helpers.py", line 210, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: fiftyone.datasets index: sample_collection_name_1 dup key: { sample_collection_name: "samples.2021.10.25.12.08.00" }, full error: {'index': 0, 'code': 11000, 'keyPattern': {'sample_collection_name': 1}, 'keyValue': {'sample_collection_name': 'samples.2021.10.25.12.08.00'}, 'errmsg': 'E11000 duplicate key error collection: fiftyone.datasets index: sample_collection_name_1 dup key: { sample_collection_name: "samples.2021.10.25.12.08.00" }'}

Here's the relevant parts of my export code. It appears that the mongodb key is set to a timestamp by default; is there a way to override/bypass this?

<Distributed across N processes (rank)>
  <getting samples - for loop over video filepaths>
  dataset = fo.Dataset(dataset_name + f'_{rank}')
  dataset.add_samples(samples)

  save_path = os.path.join(os.getcwd(), 'video_dataset', f'chunk_{rank}.zip')
  dataset.export(export_dir=save_path, 
                 export_media=False,
                 overwrite=True,
                 dataset_type=fo.types.FiftyOneDataset)

Thanks!

brimoor commented 3 years ago

Why are you wanting to export the dataset in the first place? Are you aware you can make datasets persistent?

import fiftyone
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-video")
dataset.name = "my-dataset"
dataset.persistent = True

Then, in another Python session:

import fiftyone as fo

dataset = fo.load_dataset("my-dataset")
brimoor commented 3 years ago

As for the error you're seeing, that is happening in fo.Dataset() (the rest is not relevant) because you are using parallel processes to create the datasets.

When datasets are created, the current timestamp is used to generate some collection names for them. If multiple datasets are created within the same second, then milliseconds are appended to disambiguate. However, when you use multiprocessing, the creation code is running so simultaneously that multiple processes think it's okay to use the same collection names because each of them decide this before any of them manage to actually insert their collections into the DB. The relevant code is here:

https://github.com/voxel51/fiftyone/blob/a8b8eb90a50f3903913e2fe7cd45134bba83b37e/fiftyone/core/dataset.py#L4490-L4509

If this parallelized dataset creation is really essential, we can consider making the collection name generation thread safe, but I tend to think that there's a better way to achieve your desired workflow. Let's see

brimoor commented 3 years ago

btw you might find this Apache Beam parallelization work interesting: https://github.com/voxel51/fiftyone/pull/1370

SiftingSands commented 3 years ago

I appreciate the detailed responses. For some reason, my thought process was locked into exporting to disk instead of just using the persistence feature you described and reloading things.

Closing as I can bypass the duplicate name error.