microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Test study gold:Gs0114663 and all of its Biosmaple parts for Napa id compliance #1205

Closed turbomam closed 1 year ago

turbomam commented 1 year ago

@aclum and @brynnz22 : I am working on a code branch that will test per-study fragments of an NMDC MongoDB dump against Napa identifier requirements.

Unfortunately, all non-identifier schema requirements will have to be met in my workflow I am building. At this time, the Studies in https://api-napa.microbiomedata.org are missing study_category values. I don't think that's the case in the production MongoDB, accessible through https://api.microbiomedata.org

Compare

I think the Napa MongoDB just got "forked" from the production database before the study_category values were added to the production database.

TL;DR: study_category values must be added to the Studies in https://api-napa.microbiomedata.org 's back-end before I can continue with this issue. Can you take care of that for me? Maybe with changesheets against https://api-napa.microbiomedata.org? I don't think I'm the best person to do that but I am glad to discuss it more.

aclum commented 1 year ago

I would strongly prefer a solution where the two systems don't have to stay in sync, otherwise we'll just be chasing our tails. I was hoping to use linkml-validate or something that prints errors instead of having them be resolved.

turbomam commented 1 year ago

I can appreciate that.

I have a plan for converting the JSON in MongoDB to RDF, from which I can easily extract just those instances that tell the story of one or more enumerated Studies.

I don't think I can do the JSON to RDF conversion on JSON data that doesn't pass validation, but I'll give it another try.

turbomam commented 1 year ago

I put the converter in --no-validate mode but it still complained about the missing study_category

@aclum your preference makes sense to me but I haven't figured a way to deliver it without significant refactoring of code or new development.

aclum commented 1 year ago

With the API you can pass a filter, ie this filters for just biosamples that are part_of nmdc:sty-12-85j6kq06 which is the study I've updated from legacy to napa curl -X 'GET' \ 'https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06&per_page=25' \ -H 'accept: application/json'

aclum commented 1 year ago

Can you pass an older version of the schema as an argument?

turbomam commented 1 year ago

Thanks. Integrating filter into my workflow in a generalizable way isn't straightforward.

The workflow expects the passed schema to be a file on the local filesystem.

I'm making progress with both

I don't think we would need to use both.

turbomam commented 1 year ago

My SPARQL filtering strategy isn't working as well as I had expected. I can just start some new Python based on API requests like https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06 if you're really sure that every collection you need to access will have a really simple filter like that.

turbomam commented 1 year ago

@aclum I am really stuck. Can we please have a working meeting this week?

turbomam commented 1 year ago

Are you sure that there are https://api-napa.microbiomedata.org/docs#/find endpoints for all of the entities that we will need to fetch?

turbomam commented 1 year ago

I'm pretty uneasy with his approach, but here's a prototype of a potential MongoDB dumping step: https://github.com/microbiomedata/nmdc-schema/blob/1205-test-study-goldgs0114663-and-all-of-its-biosmaple-parts-for-napa-id-compliance/nmdc_schema/build_datafile_from_api_requests.py

PS I'm not getting any results from https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06

aclum commented 1 year ago

Sorry, I updated that to a prod id from a dev id so the new study id is nmdc:sty-11-aygzgv51 What should work across collections is https://api-napa.microbiomedata.org/docs#/metadata/list_from_collection_nmdcschema__collection_name__get

I scheduled us some time tomorrow.