API access to MongoDB collections

turbomam commented 1 year ago

import pprint

import requests

class FastAPIClient:
    def __init__(self, base_url):
        self.base_url = base_url

    def _make_request(self, method, endpoint, params=None, data=None):
        url = f"{self.base_url}/{endpoint}"
        response = requests.request(method, url, params=params, json=data)
        response.raise_for_status()
        return response.json()

    def get_paginated_data(self, endpoint, params, results_key='resources', continuation_key='next_page_token',
                           continuation_parameter='page_token'):
        params = params or {}
        data = []

        while True:
            response = self._make_request('GET', endpoint, params=params)
            data.extend(response[results_key])

            if continuation_key in response:
                params[continuation_parameter] = response[continuation_key]
            else:
                break

        return data

if __name__ == "__main__":
    client_base_url = "https://api.microbiomedata.org"
    endpoint_name = "nmdcschema/study_set"
    params_string = {
        "max_page_size": 20
    }

    client = FastAPIClient(client_base_url)
    paginated_data = client.get_paginated_data(endpoint=endpoint_name, params=params_string)
    pprint.pprint(paginated_data)

turbomam commented 1 year ago

trace this requirement back the following targets in project.Makefile

dump-validate-report-mongodb: mongodb-cleanup accepting_legacy_ids_all \
local/mongodb-collection-report.txt \
local/selected_mongodb_contents.json \
local/selected_mongodb_contents_jsonschema_check.txt \
linkml-validate-mongodb \
local/selected_mongodb_contents.json.gz

dump-validate-report-convert-mongodb: mongodb-cleanup \
local/selected_mongodb_contents_fully_repaired.yaml \
local/selected_mongodb_contents_fully_repaired.yaml.gz \
local/selected_mongodb_contents_fully_repaired.ttl \
local/selected_mongodb_contents_fully_repaired.ttl.gz

turbomam commented 1 year ago

Which start with the mongodb_exporter CLI, which is defined as follows by pyproject.toml

mongodb_exporter = "nmdc_schema.mongodb_direct_to_nmdc_Database_file:export_to_yaml"

turbomam commented 1 year ago

We will be using methods from https://api.microbiomedata.org/docs

turbomam commented 1 year ago

There doesn't seem to be a get collection names method. May still need to get that from a direct MongoDB connection from now, which generally requires a NERSC ssh key, a NERSC tunnel, and MongoDB credentials.

turbomam commented 1 year ago

Could also use mongodump or mongoexport commands. Would still require assembling the JSON files into LinkML style JSON, even if it isn't validated "yet"

turbomam commented 1 year ago

object orientation:

Python dataclass?
Pydantic?

turbomam commented 1 year ago

Implemented in nmdc_schema/mongo_dump_api_emph.py from branch issue-1070-content-from-mongo

that script can't currently get functional_annotation_agg via the API and defaults back to PyMongo.
that script still uses PyMongo to get collection names and estimated sizes. @dwinston recently enabled a API solution for this and I should swtich.

eecavanna commented 1 year ago

that script still uses PyMongo to get collection names and estimated sizes. @dwinston recently enabled a API solution for this and I should swtich.

Here's a link to the PR in the nmdc-runtime repo, in which that API solution was introduced: https://github.com/microbiomedata/nmdc-runtime/pull/287

Here's a link to the API endpoint on Swagger UI (in production): https://api.microbiomedata.org/docs#/metadata/get_nmdc_database_collection_stats_nmdcschema_collection_stats_get

eecavanna commented 1 year ago

Issue cleanup note:

Update Issue title to be more actionable; e.g. "Use API to get MongoDB collection names".

aclum commented 1 year ago

Anything left to do here?

turbomam commented 1 year ago

I want to update mongo_dump_api_emph.py so that it can get per-collection document counts from https://api.microbiomedata.org/nmdcschema/collection_stats

I think there is an issue for that already, but I haven't found it yet. When I do I will link it here and close this issue.

turbomam commented 1 year ago

microbiomedata / nmdc-schema

API access to MongoDB collections #1070