Server not robust against exceptions when registering datasets

jotelha commented 3 years ago

We encounter the following issue apparently related to the underlying mongodb's limitations when indexing datasets with

flask base_uri index SOME_URI

Traceback (most recent call last):
  File "/usr/local/bin/flask", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/flask/cli.py", line 967, in main
    cli.main(args=sys.argv[1:], prog_name="python -m flask" if as_module else None)
  File "/usr/local/lib/python3.8/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dtool_lookup_server/cli.py", line 212, in index_base_uri
    r = register_dataset(dataset_info)
  File "/usr/local/lib/python3.8/site-packages/dtool_lookup_server/utils.py", line 660, in register_dataset
    register_dataset_descriptive_metadata(dataset_info)
  File "/usr/local/lib/python3.8/site-packages/dtool_lookup_server/utils.py", line 601, in register_dataset_descriptive_metadata
    _register_dataset_descriptive_metadata(collection, dataset_info)
  File "/usr/local/lib/python3.8/site-packages/dtool_lookup_server/utils.py", line 633, in _register_dataset_descriptive_metadata
    collection.insert_one(dataset_info)
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 698, in insert_one
    self._insert(document,
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 613, in _insert
    return self._insert_one(
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 602, in _insert_one
    self.__database.client._retryable_write(
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1498, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return self._retry_internal(retryable, func, session, bulk)
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1416, in _retry_internal
    return func(session, sock_info, retryable)
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 590, in _insert_command
    result = sock_info.command(
  File "/usr/local/lib/python3.8/site-packages/pymongo/pool.py", line 699, in command
    self._raise_connection_failure(error)
  File "/usr/local/lib/python3.8/site-packages/pymongo/pool.py", line 683, in command
    return command(self, dbname, spec, slave_ok,
  File "/usr/local/lib/python3.8/site-packages/pymongo/network.py", line 135, in command
    message._raise_document_too_large(
  File "/usr/local/lib/python3.8/site-packages/pymongo/message.py", line 1085, in _raise_document_too_large
    raise DocumentTooLarge("BSON document too large (%d bytes)"
pymongo.errors.DocumentTooLarge: BSON document too large (26308861 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.

and I believe that might be due to some README.yml of excessive size (without looking to carefully). The main issue, however, is that the indexing stops with a failure and would ignore al subsequent datasets on this URI forever. What about making that line

https://github.com/jic-dtool/dtool-lookup-server/blob/888b775a74b8666d12b0bccb47d511981e378053/dtool_lookup_server/cli.py#L212

exception-resilient as well?

tjelvar-olsson commented 3 years ago

Hi @jotelha,

First of all I'd like to understand the background a bit better, i.e. what type of information are you storing in the readme? why is it so large?

Secondly this raises questions about consistency and what the behaviour should be in this situation. Looking at the register_dataset function:

https://github.com/jic-dtool/dtool-lookup-server/blob/888b775a74b8666d12b0bccb47d511981e378053/dtool_lookup_server/utils.py#L645

I think it would partly register the dataset in this scenario. I.e. the "admin metadata" would get registered, but the "descriptive metadata" would not. I'm not sure if this is good or bad...

pastewka commented 3 years ago

As far as I understand we think it's not the README but the manifest that is too large.

@antoinesimtek Can you comment on this? Did the dataset in question have lots of files?

sannant commented 3 years ago

We think we identified the dataset. It is a dataset for which I increased the "max allowed number of items" environment variable (I couldn't find again the exact name) in order to be allowed to freeze it.

That's why I think the file causing the problem is something like the manifest file.

sannant commented 3 years ago

Here the metadata of the dataset. Manifest file is 32MB long .... 800f1b37-4bff-4013-805d-1a1ff61de1f2 _metada.zip

jotelha commented 3 years ago

Yes, I was wrong about my initial README.yml suspicion, it's the 32 MB manifest.json.

$ ls -lhs .dtool
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 annotations
4,0K -rwx------ 1 jotelha jotelha  234 Okt 17 02:05 dtool
 32M -rwx------ 1 jotelha jotelha  32M Okt 17 00:55 manifest.json
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 overlays
4,0K -rwx------ 1 jotelha jotelha  923 Okt 15 15:58 README.txt
4,0K -rwx------ 1 jotelha jotelha  685 Okt 15 15:58 structure.json
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 tags

There are > 120k files in that dataset,

$ cat manifest.json | grep relpath | wc -l
120440

In my opinion, in such a case the server should throw a warning, but continue to register other datasets. This issue does not arise from any invalid dataset, but from limitations of the database configuration. Either, we would have to catch any exception coming from the PyMongo interface at

https://github.com/jic-dtool/dtool-lookup-server/blob/888b775a74b8666d12b0bccb47d511981e378053/dtool_lookup_server/utils.py#L630-L635

and turn them into warnings, or we would look at what subset of exceptions at https://pymongo.readthedocs.io/en/stable/api/pymongo/errors.html we might want to filter.

A third option to make the server resilient against large documents would be the way they handle the same issue in the Fireworks code I am quite familiar with. There, they would use a simple object storage filesystem integrated within mongo db (called GridFS, I don't like it very much for other reasons) to split over-sized documents into chunks, i.e. here for inserting a document

https://github.com/materialsproject/fireworks/blob/07bace776fedefd09907272334a2c5925ffce51d/fireworks/core/launchpad.py#L1468-L1497

and here for retrieving

https://github.com/materialsproject/fireworks/blob/07bace776fedefd09907272334a2c5925ffce51d/fireworks/core/launchpad.py#L2182-L2204. That would be a little more effort but something worth keeping in mind for the future.

BTW, from the former of the two snippets above we can readily see that we would have to catch pymongo.errors.DocumentTooLarge in our particular case here.

Coming back to the README's format: we believe also here a resilient behavior with the server first trying to parse the README as yaml or json or ... then falling back to plain text in the case of failure and inserting the content in one way or the other (i.e. nested structure or plain text field) into the mongo db's index would be desirable. If you are planning to relax the constraints on the README.yml file format and allow plain text content in the future, then that would leave the decision to each user whether they stick to the currently enforced structured content or not.

tjelvar-olsson commented 3 years ago

Side note, another way to get an overview of a dataset is to run dtool summary, this would have told you that there were 120,440 items in the dataset, e.g.:

$ dtool summary .\sample_datasets\Escherichia-coli-reads-ERR022075\
name: Escherichia-coli-reads-ERR022075
uuid: faa44606-cb86-4877-b9ea-643a3777e021
creator_username: olssont
number_of_items: 2
size: 3.6GiB
frozen_at: 2020-02-26

sannant commented 3 years ago

Thank you !

jotelha commented 3 years ago

Yes, but I didn't have the dataset at hand, just the extracted hidden .dool folder packaged by Antoine.

Although that works all the same when naming it .dtool and packing into some wrapping directory,

$ dtool summary wrapping_directory
name: 1909_adhesion_distribution
uuid: 800f1b37-4bff-4013-805d-1a1ff61de1f2
creator_username: fr_as1412
number_of_items: 120440
size: 324.5GiB
frozen_at: 2020-10-17

tjelvar-olsson commented 3 years ago

I will spend a couple of hours looking at this now.

jotelha commented 3 years ago

Hi Tjelvar, don't invest to much time for now. I think simply catching pymongo.errors.DocumentTooLarge (and maybe some other pymongo exceptions) at

https://github.com/jic-dtool/dtool-lookup-server/blob/888b775a74b8666d12b0bccb47d511981e378053/dtool_lookup_server/utils.py#L632-L635

or at some higher level and turning them into warnings would be sufficient at the current point.

tjelvar-olsson commented 3 years ago

I have implemented a fix in a branch called robust-registration (917e8a5).

@jotelha Could you check if it fixes the issue you reported?

jotelha commented 3 years ago

Thanks, that looks good, I will put that into our testing instance on the weekend and let you know!

jotelha commented 3 years ago

Works fine on our testing instance, thanks a lot! Please feel encouraged to release that :)

On the long run, what do you think of the idea to always register datasets at least "partially" as long as they provide a unique URI / UUID tuple, even if parts of the descriptive metadata are corrupt (i.e. invalid YAML) or fail (i.e. this particular case of excessive manifest size), and in such a case store the particular exception turned warning along with it?

tjelvar-olsson commented 3 years ago

Fix included in 0.16.0 release: https://pypi.org/project/dtool-lookup-server/0.16.0/

Closing this issue.

jic-dtool / dservercore

Server not robust against exceptions when registering datasets #16