microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

`pure-export` fallback data source may not match primary data source #1621

Open eecavanna opened 7 months ago

eecavanna commented 7 months ago

The make target local/mongo_as_unvalidated_nmdc_database.yaml runs a command called pure-export, passing it several CLI options, including:

A user could have localhost port 27777 mapped to a different Mongo database than the one used by the web server at https://api.microbiomedata.org. In that case, when the pure-export command (i.e. whatever runs when that command is issued) uses a Mongo client, the Mongo database it is accessing would differ from the one it would be accessing when using HTTP.

Here's a message the command showed when I ran it:

Attempting to get collection stats from functional_annotation_agg
estimated_document_count = 11816893
Attempting to get 200000 documents from nmdcschema/functional_annotation_agg in pages of 200000.
warning: 500 Server Error: Internal Server Error for url: https://api.microbiomedata.org/nmdcschema/functional_annotation_agg?max_page_size=200000
warning: FastAPI request to nmdcschema/functional_annotation_agg appears to have failed. Trying as a PyMongo query.

I want to emphasize the final line: warning: FastAPI request to nmdcschema/functional_annotation_agg appears to have failed. Trying as a PyMongo query.

This implies to me that the underlying code first tried to fetch data via HTTP (which involved one Mongo database) and then was going to try via a Mongo client (which would involve a different Mongo database).

turbomam commented 7 months ago

Yeah, pure-export needs a lot of work! I already have a few related issues that I will add soon. @brynnz22 and I have had some conversations about the local/mongo_as_unvalidated_nmdc_database.yaml target.

Maybe there should be subcommands for

Or do you think that doing away with the hybrid/fallback solution is a bad idea @eecavanna ?

Do you have suggestions for breaking improvements to pure-export into small commits and then prioritizing them? I would love to work on this with you.

We could also include @mbthornton-lbl because he wrote an API-only dumper with fewer configuration options that only returns records (from multiple collections) if they have some path back to a named Study. We worked on that together to

I would like to either integrate get-study-related-records with pure-export or at least to make the launching and configuration even more similar.