Test study gold:Gs0114663 and all of its Biosmaple parts for Napa id compliance

turbomam commented 1 year ago

@aclum and @brynnz22 : I am working on a code branch that will test per-study fragments of an NMDC MongoDB dump against Napa identifier requirements.

Unfortunately, all non-identifier schema requirements will have to be met in my workflow I am building. At this time, the Studies in https://api-napa.microbiomedata.org are missing study_category values. I don't think that's the case in the production MongoDB, accessible through https://api.microbiomedata.org

Compare

I think the Napa MongoDB just got "forked" from the production database before the study_category values were added to the production database.

TL;DR: study_category values must be added to the Studies in https://api-napa.microbiomedata.org 's back-end before I can continue with this issue. Can you take care of that for me? Maybe with changesheets against https://api-napa.microbiomedata.org? I don't think I'm the best person to do that but I am glad to discuss it more.

aclum commented 1 year ago

I would strongly prefer a solution where the two systems don't have to stay in sync, otherwise we'll just be chasing our tails. I was hoping to use linkml-validate or something that prints errors instead of having them be resolved.

turbomam commented 1 year ago

I can appreciate that.

I have a plan for converting the JSON in MongoDB to RDF, from which I can easily extract just those instances that tell the story of one or more enumerated Studies.

I don't think I can do the JSON to RDF conversion on JSON data that doesn't pass validation, but I'll give it another try.

turbomam commented 1 year ago

I put the converter in --no-validate mode but it still complained about the missing study_category

@aclum your preference makes sense to me but I haven't figured a way to deliver it without significant refactoring of code or new development.

Maybe some other NMDC developer could think of clever way to query MongoDB to extract subsetted JSON dumps that match your criteria, or write some Python to do that task
We could continue the distasteful practice of making just-in-time modified schemas that include the constraints that you want but eliminate the ones you you want to ignore
I could move forward with some yq queries that would construct the data subset you want. I'm pretty sure that would work at lest though the OmicsProcessing step, but it might be less suited for the activities that come after that.
Or I could just assert study_category: research_study as part of the migration phase

aclum commented 1 year ago

With the API you can pass a filter, ie this filters for just biosamples that are part_of nmdc:sty-12-85j6kq06 which is the study I've updated from legacy to napa curl -X 'GET' \ 'https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06&per_page=25' \ -H 'accept: application/json'

aclum commented 1 year ago

Can you pass an older version of the schema as an argument?

turbomam commented 1 year ago

Thanks. Integrating filter into my workflow in a generalizable way isn't straightforward.

The workflow expects the passed schema to be a file on the local filesystem.

I'm making progress with both

a just-in-time modified schemas
study_category: research_study as part of the migration phase

I don't think we would need to use both.

turbomam commented 1 year ago

My SPARQL filtering strategy isn't working as well as I had expected. I can just start some new Python based on API requests like https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06 if you're really sure that every collection you need to access will have a really simple filter like that.

turbomam commented 1 year ago

@aclum I am really stuck. Can we please have a working meeting this week?

turbomam commented 1 year ago

Are you sure that there are https://api-napa.microbiomedata.org/docs#/find endpoints for all of the entities that we will need to fetch?

turbomam commented 1 year ago

I'm pretty uneasy with his approach, but here's a prototype of a potential MongoDB dumping step: https://github.com/microbiomedata/nmdc-schema/blob/1205-test-study-goldgs0114663-and-all-of-its-biosmaple-parts-for-napa-id-compliance/nmdc_schema/build_datafile_from_api_requests.py

PS I'm not getting any results from https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06

aclum commented 1 year ago

Sorry, I updated that to a prod id from a dev id so the new study id is nmdc:sty-11-aygzgv51 What should work across collections is https://api-napa.microbiomedata.org/docs#/metadata/list_from_collection_nmdcschema__collection_name__get

I scheduled us some time tomorrow.

microbiomedata / nmdc-schema

Test study gold:Gs0114663 and all of its Biosmaple parts for Napa id compliance #1205