Closed jotelha closed 3 years ago
Hi @jotelha,
First of all I'd like to understand the background a bit better, i.e. what type of information are you storing in the readme? why is it so large?
Secondly this raises questions about consistency and what the behaviour should be in this situation. Looking at the register_dataset function:
I think it would partly register the dataset in this scenario. I.e. the "admin metadata" would get registered, but the "descriptive metadata" would not. I'm not sure if this is good or bad...
As far as I understand we think it's not the README but the manifest that is too large.
@antoinesimtek Can you comment on this? Did the dataset in question have lots of files?
We think we identified the dataset. It is a dataset for which I increased the "max allowed number of items" environment variable (I couldn't find again the exact name) in order to be allowed to freeze it.
That's why I think the file causing the problem is something like the manifest file.
Here the metadata of the dataset. Manifest file is 32MB long .... 800f1b37-4bff-4013-805d-1a1ff61de1f2 _metada.zip
Yes, I was wrong about my initial README.yml suspicion, it's the 32 MB manifest.json.
$ ls -lhs .dtool
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 annotations
4,0K -rwx------ 1 jotelha jotelha 234 Okt 17 02:05 dtool
32M -rwx------ 1 jotelha jotelha 32M Okt 17 00:55 manifest.json
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 overlays
4,0K -rwx------ 1 jotelha jotelha 923 Okt 15 15:58 README.txt
4,0K -rwx------ 1 jotelha jotelha 685 Okt 15 15:58 structure.json
4,0K drwx------ 2 jotelha jotelha 4,0K Okt 15 15:58 tags
There are > 120k files in that dataset,
$ cat manifest.json | grep relpath | wc -l
120440
In my opinion, in such a case the server should throw a warning, but continue to register other datasets. This issue does not arise from any invalid dataset, but from limitations of the database configuration. Either, we would have to catch any exception coming from the PyMongo interface at
and turn them into warnings, or we would look at what subset of exceptions at https://pymongo.readthedocs.io/en/stable/api/pymongo/errors.html we might want to filter.
A third option to make the server resilient against large documents would be the way they handle the same issue in the Fireworks code I am quite familiar with. There, they would use a simple object storage filesystem integrated within mongo db (called GridFS, I don't like it very much for other reasons) to split over-sized documents into chunks, i.e. here for inserting a document
and here for retrieving
https://github.com/materialsproject/fireworks/blob/07bace776fedefd09907272334a2c5925ffce51d/fireworks/core/launchpad.py#L2182-L2204. That would be a little more effort but something worth keeping in mind for the future.
BTW, from the former of the two snippets above we can readily see that we would have to catch pymongo.errors.DocumentTooLarge
in our particular case here.
Coming back to the README's format: we believe also here a resilient behavior with the server first trying to parse the README as yaml or json or ... then falling back to plain text in the case of failure and inserting the content in one way or the other (i.e. nested structure or plain text field) into the mongo db's index would be desirable. If you are planning to relax the constraints on the README.yml file format and allow plain text content in the future, then that would leave the decision to each user whether they stick to the currently enforced structured content or not.
Side note, another way to get an overview of a dataset is to run dtool summary
, this would have told you that there were 120,440 items in the dataset, e.g.:
$ dtool summary .\sample_datasets\Escherichia-coli-reads-ERR022075\
name: Escherichia-coli-reads-ERR022075
uuid: faa44606-cb86-4877-b9ea-643a3777e021
creator_username: olssont
number_of_items: 2
size: 3.6GiB
frozen_at: 2020-02-26
Thank you !
Yes, but I didn't have the dataset at hand, just the extracted hidden .dool folder packaged by Antoine.
Although that works all the same when naming it .dtool and packing into some wrapping directory,
$ dtool summary wrapping_directory
name: 1909_adhesion_distribution
uuid: 800f1b37-4bff-4013-805d-1a1ff61de1f2
creator_username: fr_as1412
number_of_items: 120440
size: 324.5GiB
frozen_at: 2020-10-17
I will spend a couple of hours looking at this now.
Hi Tjelvar, don't invest to much time for now. I think simply catching pymongo.errors.DocumentTooLarge (and maybe some other pymongo exceptions) at
or at some higher level and turning them into warnings would be sufficient at the current point.
I have implemented a fix in a branch called robust-registration (917e8a5).
@jotelha Could you check if it fixes the issue you reported?
Thanks, that looks good, I will put that into our testing instance on the weekend and let you know!
Works fine on our testing instance, thanks a lot! Please feel encouraged to release that :)
On the long run, what do you think of the idea to always register datasets at least "partially" as long as they provide a unique URI / UUID tuple, even if parts of the descriptive metadata are corrupt (i.e. invalid YAML) or fail (i.e. this particular case of excessive manifest size), and in such a case store the particular exception turned warning along with it?
Fix included in 0.16.0 release: https://pypi.org/project/dtool-lookup-server/0.16.0/
Closing this issue.
We encounter the following issue apparently related to the underlying mongodb's limitations when indexing datasets with
and I believe that might be due to some README.yml of excessive size (without looking to carefully). The main issue, however, is that the indexing stops with a failure and would ignore al subsequent datasets on this URI forever. What about making that line
https://github.com/jic-dtool/dtool-lookup-server/blob/888b775a74b8666d12b0bccb47d511981e378053/dtool_lookup_server/cli.py#L212
exception-resilient as well?