microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

need to link SPRUCE data objects to gold projects #399

Closed wdduncan closed 2 years ago

wdduncan commented 2 years ago

@scanon @dwinston @dehays We to be able to link gold projects to the data objects you have for the SPRUCE project. I thought you were going to put the gold id of project in the was generated by slot. But I may be misremembering.

Here is a list of the gold project ids (123 of them).

Gp0208361
Gp0208364
Gp0213354
Gp0213365
Gp0138734
Gp0138735
Gp0138741
Gp0138753
Gp0213340
Gp0213342
Gp0213348
Gp0213349
Gp0213367
Gp0213375
Gp0208363
Gp0213334
Gp0213360
Gp0213361
Gp0138756
Gp0138760
Gp0138727
Gp0138742
Gp0138749
Gp0213336
Gp0213346
Gp0213359
Gp0213369
Gp0213371
Gp0208349
Gp0208350
Gp0208353
Gp0208377
Gp0208378
Gp0208380
Gp0208381
Gp0208362
Gp0213366
Gp0138759
Gp0138736
Gp0138752
Gp0138762
Gp0213357
Gp0208366
Gp0208370
Gp0208374
Gp0208375
Gp0213345
Gp0213355
Gp0213372
Gp0138740
Gp0138743
Gp0138758
Gp0138733
Gp0138744
Gp0213363
Gp0213373
Gp0213374
Gp0208358
Gp0208367
Gp0208371
Gp0208346
Gp0213332
Gp0213333
Gp0213335
Gp0213343
Gp0138746
Gp0138739
Gp0213347
Gp0213352
Gp0208343
Gp0208355
Gp0208356
Gp0208379
Gp0208344
Gp0208351
Gp0208360
Gp0213331
Gp0213362
Gp0138731
Gp0138738
Gp0138745
Gp0138750
Gp0138754
Gp0138761
Gp0138764
Gp0138732
Gp0138755
Gp0213353
Gp0213368
Gp0213370
Gp0208365
Gp0208376
Gp0208347
Gp0213337
Gp0213350
Gp0213364
Gp0138748
Gp0138747
Gp0213339
Gp0213356
Gp0213344
Gp0208352
Gp0208359
Gp0208372
Gp0208382
Gp0208348
Gp0208369
Gp0213338
Gp0213341
Gp0138728
Gp0138729
Gp0138730
Gp0138737
Gp0138757
Gp0138763
Gp0138751
Gp0213358
Gp0213351
Gp0208354
Gp0208357
Gp0208373
Gp0208345
Gp0208368
dehays commented 2 years ago

@wdduncan The slot you are thinking of on metaG activities is 'was_informed_by'

@wdduncan @dwinston Linking the sequencing projects that Bill has included could be done by iterating through the list of SP IDs to get instances of documents in the read_QC_analysis_activity_set. For example, query for the nmdc:ReadQCAnalysisActivity where was_informed_by is "gold:Gp0208361", etc.

`{

"has_input": ["nmdc:dac54b23fce5a5c56c11311c77b74294"],

"git_url": "https://github.com/microbiomedata/mg_annotation/releases/tag/0.1",

"has_output": ["nmdc:457cded9b27ef66bb7a306dd61639774", "nmdc:2d6aaadb2e2d175ab3c39df88cabfa09"],

"was_informed_by": "gold:Gp0208361",

"input_read_count": 86662544,

"output_read_bases": {
    "$numberLong": "12902064623"
},

"id": "nmdc:f1d1a3d044767ded5828bf67415a41be",

"execution_resource": "NERSC-Cori",

"input_read_bases": {
    "$numberLong": "13086044144"
},

"name": "Read QC Activity for nmdc:mga07w21",

"output_read_count": 86035480,

"started_at_time": "2021-08-11T00:35:01+00:00",

"type": "nmdc:ReadQCAnalysisActivity",

"ended_at_time": "2021-09-15T00:02:20+00:00"

}`

Then use the value you see in the has_input list to set a value in the has_output of the nmdc:OmicsProcessing document with ID "gold:Gp0208361".

Repeat for the next project ID, etc.

wdduncan commented 2 years ago

oops ... I used the wrong predicate, should have been was _informed_by.

So, in the example above, the has_output property for gold:Gp0208361 should be set to ["nmdc:dac54b23fce5a5c56c11311c77b74294"].

Right?

scanon commented 2 years ago

I'm using the same ID type that I used for the original FICUS data sets. How was the linkage happening before? I had assumed that there was an omic_process_activity record that had the "gold:GpXXXX" id and some other linkages.

wdduncan commented 2 years ago

@scanon in the previous iteration, @dehays provided me with a file that contained the outputs of the gold projects.

dwinston commented 2 years ago

I was able to add data_object_set IDs to has_output for 106 of the 123 omics_processing_set documents that are part_of the SPRUCE study (gold:Gs0110138), by finding these data_object_set IDs as the has_input of read_QC_analysis_activity_set documents that are was_informed_by any of those 123 omics_processing_set IDs (in this case, GOLD project IDs).

The following 17 omics_processing_set IDs were not found to inform (via was_informed_by) any read_QC_analysis_activity_set documents:

gold:Gp0208358
gold:Gp0208345
gold:Gp0208347
gold:Gp0208357
gold:Gp0208346
gold:Gp0208344
gold:Gp0208348
gold:Gp0208356
gold:Gp0208355
gold:Gp0208351
gold:Gp0208343
gold:Gp0208380
gold:Gp0208350
gold:Gp0208349
gold:Gp0208352
gold:Gp0208353
gold:Gp0208354
dehays commented 2 years ago

@dwinston Speaking with Shane earlier - he said analysis was done for 107 metagenomes but that there may have been one that had issues. The remaining 16 are metatranscriptomes. So I think you have all 106 projects for which there is analysis.

Thinking beyond the current metadata release, Shane suggested that analysis metadata automation might be better suited to take on responsibility for creating the omics_processing (sequencing project) creation - because it is at that point that there is enough information to know if a sequencing project should be included; unlike the GOLD ETL, which does not know what is going to happen later with analysis. From the initial query to JAMO to get the raw fastq, the GOLD project ID is also available. It is possible that the GOLD biosample and GOLD study are NOT always available from JAMO metadata records, but the GOLD API could provide a source for these.

dwinston commented 2 years ago

Thanks. It seems that gold:Gp0208380 is the one with an issue.

Also, the following is true, as you requested:

(mdb.omics_processing_set.count_documents({"has_output.0": {"$exists": True}})
 ==
 mdb.omics_processing_set.count_documents({})
)
wdduncan commented 2 years ago

Can we close this issue?