Closed wdduncan closed 2 years ago
@wdduncan The slot you are thinking of on metaG activities is 'was_informed_by'
@wdduncan @dwinston Linking the sequencing projects that Bill has included could be done by iterating through the list of SP IDs to get instances of documents in the read_QC_analysis_activity_set. For example, query for the nmdc:ReadQCAnalysisActivity where was_informed_by is "gold:Gp0208361", etc.
`{
"has_input": ["nmdc:dac54b23fce5a5c56c11311c77b74294"],
"git_url": "https://github.com/microbiomedata/mg_annotation/releases/tag/0.1",
"has_output": ["nmdc:457cded9b27ef66bb7a306dd61639774", "nmdc:2d6aaadb2e2d175ab3c39df88cabfa09"],
"was_informed_by": "gold:Gp0208361",
"input_read_count": 86662544,
"output_read_bases": {
"$numberLong": "12902064623"
},
"id": "nmdc:f1d1a3d044767ded5828bf67415a41be",
"execution_resource": "NERSC-Cori",
"input_read_bases": {
"$numberLong": "13086044144"
},
"name": "Read QC Activity for nmdc:mga07w21",
"output_read_count": 86035480,
"started_at_time": "2021-08-11T00:35:01+00:00",
"type": "nmdc:ReadQCAnalysisActivity",
"ended_at_time": "2021-09-15T00:02:20+00:00"
}`
Then use the value you see in the has_input list to set a value in the has_output of the nmdc:OmicsProcessing document with ID "gold:Gp0208361".
Repeat for the next project ID, etc.
oops ... I used the wrong predicate, should have been was _informed_by
.
So, in the example above, the has_output
property for gold:Gp0208361
should be set to ["nmdc:dac54b23fce5a5c56c11311c77b74294"]
.
Right?
I'm using the same ID type that I used for the original FICUS data sets. How was the linkage happening before? I had assumed that there was an omic_process_activity record that had the "gold:GpXXXX" id and some other linkages.
@scanon in the previous iteration, @dehays provided me with a file that contained the outputs of the gold projects.
I was able to add data_object_set
IDs to has_output
for 106 of the 123 omics_processing_set
documents that are part_of
the SPRUCE study (gold:Gs0110138
), by finding these data_object_set
IDs as the has_input
of read_QC_analysis_activity_set
documents that are was_informed_by
any of those 123 omics_processing_set
IDs (in this case, GOLD project IDs).
The following 17 omics_processing_set
IDs were not found to inform (via was_informed_by
) any read_QC_analysis_activity_set
documents:
gold:Gp0208358
gold:Gp0208345
gold:Gp0208347
gold:Gp0208357
gold:Gp0208346
gold:Gp0208344
gold:Gp0208348
gold:Gp0208356
gold:Gp0208355
gold:Gp0208351
gold:Gp0208343
gold:Gp0208380
gold:Gp0208350
gold:Gp0208349
gold:Gp0208352
gold:Gp0208353
gold:Gp0208354
@dwinston Speaking with Shane earlier - he said analysis was done for 107 metagenomes but that there may have been one that had issues. The remaining 16 are metatranscriptomes. So I think you have all 106 projects for which there is analysis.
Thinking beyond the current metadata release, Shane suggested that analysis metadata automation might be better suited to take on responsibility for creating the omics_processing (sequencing project) creation - because it is at that point that there is enough information to know if a sequencing project should be included; unlike the GOLD ETL, which does not know what is going to happen later with analysis. From the initial query to JAMO to get the raw fastq, the GOLD project ID is also available. It is possible that the GOLD biosample and GOLD study are NOT always available from JAMO metadata records, but the GOLD API could provide a source for these.
Thanks. It seems that gold:Gp0208380
is the one with an issue.
Also, the following is true, as you requested:
(mdb.omics_processing_set.count_documents({"has_output.0": {"$exists": True}})
==
mdb.omics_processing_set.count_documents({})
)
Can we close this issue?
@scanon @dwinston @dehays We to be able to link gold projects to the data objects you have for the SPRUCE project. I thought you were going to put the gold id of project in the
was generated by
slot. But I may be misremembering.Here is a list of the gold project ids (123 of them).