Closed turbomam closed 1 year ago
trace this requirement back the following targets in project.Makefile
dump-validate-report-mongodb: mongodb-cleanup accepting_legacy_ids_all \
local/mongodb-collection-report.txt \
local/selected_mongodb_contents.json \
local/selected_mongodb_contents_jsonschema_check.txt \
linkml-validate-mongodb \
local/selected_mongodb_contents.json.gz
dump-validate-report-convert-mongodb: mongodb-cleanup \
local/selected_mongodb_contents_fully_repaired.yaml \
local/selected_mongodb_contents_fully_repaired.yaml.gz \
local/selected_mongodb_contents_fully_repaired.ttl \
local/selected_mongodb_contents_fully_repaired.ttl.gz
Which start with the mongodb_exporter
CLI, which is defined as follows by pyproject.toml
mongodb_exporter = "nmdc_schema.mongodb_direct_to_nmdc_Database_file:export_to_yaml"
We will be using methods from https://api.microbiomedata.org/docs
There doesn't seem to be a get collection names method. May still need to get that from a direct MongoDB connection from now, which generally requires a NERSC ssh key, a NERSC tunnel, and MongoDB credentials.
Could also use mongodump or mongoexport commands. Would still require assembling the JSON files into LinkML style JSON, even if it isn't validated "yet"
object orientation:
Implemented in nmdc_schema/mongo_dump_api_emph.py
from branch issue-1070-content-from-mongo
functional_annotation_agg
via the API and defaults back to PyMongo.
- that script still uses PyMongo to get collection names and estimated sizes. @dwinston recently enabled a API solution for this and I should swtich.
Here's a link to the PR in the nmdc-runtime
repo, in which that API solution was introduced: https://github.com/microbiomedata/nmdc-runtime/pull/287
Here's a link to the API endpoint on Swagger UI (in production): https://api.microbiomedata.org/docs#/metadata/get_nmdc_database_collection_stats_nmdcschema_collection_stats_get
Issue cleanup note:
Anything left to do here?
I want to update mongo_dump_api_emph.py so that it can get per-collection document counts from https://api.microbiomedata.org/nmdcschema/collection_stats
I think there is an issue for that already, but I haven't found it yet. When I do I will link it here and close this issue.