Closed turbomam closed 1 year ago
I would strongly prefer a solution where the two systems don't have to stay in sync, otherwise we'll just be chasing our tails. I was hoping to use linkml-validate or something that prints errors instead of having them be resolved.
I can appreciate that.
I have a plan for converting the JSON in MongoDB to RDF, from which I can easily extract just those instances that tell the story of one or more enumerated Studies.
I don't think I can do the JSON to RDF conversion on JSON data that doesn't pass validation, but I'll give it another try.
I put the converter in --no-validate
mode but it still complained about the missing study_category
@aclum your preference makes sense to me but I haven't figured a way to deliver it without significant refactoring of code or new development.
yq
queries that would construct the data subset you want. I'm pretty sure that would work at lest though the OmicsProcessing
step, but it might be less suited for the activities that come after that.study_category: research_study
as part of the migration phaseWith the API you can pass a filter, ie this filters for just biosamples that are part_of nmdc:sty-12-85j6kq06 which is the study I've updated from legacy to napa curl -X 'GET' \ 'https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06&per_page=25' \ -H 'accept: application/json'
Can you pass an older version of the schema as an argument?
Thanks. Integrating filter
into my workflow in a generalizable way isn't straightforward.
The workflow expects the passed schema to be a file on the local filesystem.
I'm making progress with both
I don't think we would need to use both.
My SPARQL filtering strategy isn't working as well as I had expected. I can just start some new Python based on API requests like https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06 if you're really sure that every collection you need to access will have a really simple filter like that.
@aclum I am really stuck. Can we please have a working meeting this week?
Are you sure that there are https://api-napa.microbiomedata.org/docs#/find endpoints for all of the entities that we will need to fetch?
I'm pretty uneasy with his approach, but here's a prototype of a potential MongoDB dumping step: https://github.com/microbiomedata/nmdc-schema/blob/1205-test-study-goldgs0114663-and-all-of-its-biosmaple-parts-for-napa-id-compliance/nmdc_schema/build_datafile_from_api_requests.py
PS I'm not getting any results from https://api-napa.microbiomedata.org/biosamples?filter=part_of%3Anmdc%3Asty-12-85j6kq06
Sorry, I updated that to a prod id from a dev id so the new study id is nmdc:sty-11-aygzgv51 What should work across collections is https://api-napa.microbiomedata.org/docs#/metadata/list_from_collection_nmdcschema__collection_name__get
I scheduled us some time tomorrow.
@aclum and @brynnz22 : I am working on a code branch that will test per-study fragments of an NMDC MongoDB dump against Napa identifier requirements.
Unfortunately, all non-identifier schema requirements will have to be met in my workflow I am building. At this time, the Studies in https://api-napa.microbiomedata.org are missing
study_category
values. I don't think that's the case in the production MongoDB, accessible through https://api.microbiomedata.orgCompare
I think the Napa MongoDB just got "forked" from the production database before the
study_category
values were added to the production database.TL;DR:
study_category
values must be added to the Studies in https://api-napa.microbiomedata.org 's back-end before I can continue with this issue. Can you take care of that for me? Maybe with changesheets against https://api-napa.microbiomedata.org? I don't think I'm the best person to do that but I am glad to discuss it more.