microbiomedata / nmdc_automation

Prototype automation
2 stars 2 forks source link

ID blade mismatch for metagenome_assembly_set records for (EMP500) #201

Closed aclum closed 1 month ago

aclum commented 3 months ago

This ticket derives URL checking work done in https://github.com/microbiomedata/nmdc_automation/issues/186

All 79 metagenome_assembly_set records from nmdc:sty-11-547rwq94 have a different blade in mongo that the files that are on the file system. This means that the ULRs for the data files (data_object_set.url) don't resolve.

options for fixing this: 1) re-id using the IDs in mongo prod using nmdc_sty-11-547rwq94_updated_record_identifiers.tsv to map from new ID in mongo to legacy ID and get the old file system path and make new assembly file and DataObject records 2) find a commit which has metagenome_assembly_set and data_object_set records which correspond the IDs on the file system, delete existing metagenome_assembly_set and data_object_set records that are has_output from metagenome_assembly_set and replace with metagenome_assembly_set and data_object_set records that match what is on the file system. - not enough information in Michal's commit do use this option.

Example: For Workflow records in the database that are part_of nmdc:omprc-11-0nftn704 : reads QC nmdc:wfrqc-11-hhbdnx62.1 has a corresponding data dir read-based taxonomy nmdc:wfrbt-11-0es1zz02.1 has a corresponding data dir assembly nmdc:wfmgas-11-fk987013.1 No data dir

directories on the file system nmdc:wfmgas-11-hbhxv517.1 nmdc:wfmgas-11-371gjw05.1

URLS that don't resolve and are blade issues, not ID version issues. asm_blade_not_in_mongo.txt

aclum commented 3 months ago

Assembly IDs that are on the filesystem (nmdc:wfmgas-11-hbhxv517.1) are first referenced on https://github.com/microbiomedata/nmdc_automation/commit/44978ae267ec33a863f0f43c5cd91094986feadc by Michal on May 13, 2024 however the only file checked in is nmdc:sty-11-547rwq94_process_records.log which isn't enough to use these records since there is not a corresponding metadata json file that we could submit

aclum commented 3 months ago

Then on May 16 the file that references nmdc:wfmgas-11-hbhxv517.1 gets moved by @mbthornton-lbl

https://github.com/microbiomedata/nmdc_automation/commits/be736e79ff58a344d93150838e25e7c066f9eeb3/nmdc_automation/re_iding/scripts/data/nmdc%3Asty-11-547rwq94/test_run/nmdc%3Asty-11-547rwq94_process_records.log

aclum commented 3 months ago

This commit on May 8th by Michael is the first reference to the IDs that exist in mongo (nmdc:wfmgas-11-fk987013.1). Note commit history says 'data and log files from process-records no filesystem updates'

https://github.com/microbiomedata/nmdc_automation/commit/e302e1bd62f276cd655225b86bc9e195ecd7a224

aclum commented 3 months ago

This commit on May 17th by Michael deleted the references to workflow IDs (nmdc:wfmgas-11-fk987013.1) that exist in mongo https://github.com/microbiomedata/nmdc_automation/commit/0ba0d4f66f3a52e9fdff63c339d1e1df08771e43

A local copy or a copy in some other branch of nmdc:sty-11-547rwq94_updated_record_identifiers.tsv must have been used for production re-iding since that happened on May 20, 2024 for this study

This commit June 25th (today) re-adds those references https://github.com/microbiomedata/nmdc_automation/commit/8103d78f9b0a8f7f00c5b49fe146b955ad9bec7d

mbthornton-lbl commented 2 months ago

asm_blade_not_in_mongo.txt

aclum commented 2 months ago

Options for resolution as discussed during 1x1 on July 23, 2024

  1. Find the files that correspond to the records in mongo. Places to look would be on your local system, $SCRATCH on perlmutter or /global/cfs/cdirs/m3408 on perlmutter. We both strongly speculated that these files must exist somewhere because it is unclear how else the DataObject records would have been created. Use unix commands grep or find to search for, for example nmdc_wfmgas-11-7xb99s46 which is the prefix of the file name for data_object_set id nmdc:dobj-11-mbnnyt32. Max time spent on this should be a few hours before moving to the next option.

  2. Use what is in mongo prod plus nmdc_automation/re_iding/scripts/data/nmdc:sty-11-547rwq94/nmdc:sty-11-547rwq94_updated_record_identifiers.tsv to make records that match the identifiers already used. Start with records in mongo prod, because some of the records nmdc:sty-11-547rwq94_updated_record_identifiers.tsv have subsequently been deleted in https://github.com/microbiomedata/nmdc_automation/issues/162. Example: for metagenome_assembly_set id nmdc:wfmgas-11-7xb99s46.1. Parse nmdc:sty-11-547rwq94_associated_record_dump.json or a backup of prod that predates re-iding to find the legacy assembly record plus nmdc:sty-11-547rwq94_updated_record_identifiers.tsv. The value of has_input for nmdc:wfmgas-11-7xb99s46.1 is nmdc:dobj-11-ksmwnd71 which maps to a legacy identifier of of nmdc:4d9348c294cc6f2924616383a37bd132. That record has a url of https://data.microbiomedata.org/data/nmdc:mga0szsj83/assembly/nmdc_mga0szsj83_contigs.fna. Alternatively, you should be able to reconstruct a path with the value of part_of from the legacy MetagenomeAssembly record (ie https://data.microbiomedata.org/data/$PART_OF/assembly). Use existing re-iding code + nmdc:sty-11-547rwq94_updated_record_identifiers.tsv to make the needed file system updates, including changing headers, with /global/cfs/cdirs/m3408/results/nmdc:mga0szsj83/assembly/nmdc_mga0szsj83_contigs.fna getting re-id to /global/cfs/cdirs/m3408/results/nmdc:omprc-11-xxv2qg83/nmdc:wfmgas-11-7xb99s46.1/nmdc_wfmgas-11-7xb99s46.1_contigs.fna with a DataObject id of nmdc:dobj-11-mbnnyt32. Sanity check that file_size_byes and md5_checksum values for records in mongo match newly re-ided records. If not merged with main yet use this branch for the log files. If it takes more than 1-2 days to write the code proceed to option 3.

  3. Write code to make a json body for the queries:run endpoint which deletes metagenome_assembly_set records which come from nmdc:sty-11-547rwq94 (aggregate on omics_processing_set part_of to get the list) and the downstream data_object_set records that are has_output of the metagenome_assembly_set records. Execute the queries:run commands, double check prod, then provide Shane with a list of omics processing records to trigger.

mbthornton-lbl commented 2 months ago

Option 1 does not yield results.

Option 2 is working at least for the example case:

Assembly: nmdc:wfmgas-11-fk987013.1

    "has_output" : [
        "nmdc:dobj-11-sv15zh53",
        "nmdc:dobj-11-yj00s852",
        "nmdc:dobj-11-bxk8ap92",
        "nmdc:dobj-11-qsc5dw38",
        "nmdc:dobj-11-2a2mc690"
    ],

Searching nmdc:sty-11-547rwq94_updated_record_identifiers.tsv:

data_object_set nmdc:ca87a5b68734d57af76ae5b39cfb3368   nmdc:dobj-11-sv15zh53

Search legacy data object ID in nmdc:sty-11-547rwq94_associated_record_dump.json:

{
                "description": "Assembled contigs fasta for gold:Gp0452693",
                "url": "https://data.microbiomedata.org/data/nmdc:mga0v1h344/assembly/nmdc_mga0v1h344_contigs.fna",
                "md5_checksum": "ca87a5b68734d57af76ae5b39cfb3368",
                "file_size_bytes": 12347117,
                "id": "nmdc:ca87a5b68734d57af76ae5b39cfb3368",
                "name": "gold:Gp0452693_Assembled contigs fasta",
                "data_object_type": "Assembly Contigs"
            }

Search Filesystem for nmdc:mga0v1h344/assembly

[nmdcda@dtn01 results]$ ll nmdc:mga0v1h344/assembly
total 1119686
drwxr-sr-x 2 nmdcda m3408      4096 Apr 11  2022 .
drwxr-sr-x 7 nmdcda m3408      4096 Apr 12  2022 ..
-rw-r--r-- 2 nmdcda m3408      1316 Apr 11  2022 activity.json
-rw-r--r-- 2 nmdcda m3408      1924 Apr 11  2022 data_objects.json
-rw-r--r-- 2 nmdcda m3408       539 Apr 11  2022 nmdc_mga0v1h344_asm_stats.json
-rw-r--r-- 2 nmdcda m3408   1874033 Apr 11  2022 nmdc_mga0v1h344_assembly.agp
-rw-r--r-- 2 nmdcda m3408        33 Apr 11  2022 nmdc_mga0v1h344_assembly.agp.md5
-rw-r--r-- 2 nmdcda m3408  12347117 Apr 11  2022 nmdc_mga0v1h344_contigs.fna
-rw-r--r-- 2 nmdcda m3408        33 Apr 11  2022 nmdc_mga0v1h344_contigs.fna.md5
-rw-r--r-- 2 nmdcda m3408   2163125 Apr 11  2022 nmdc_mga0v1h344_covstats.txt
-rw-r--r-- 2 nmdcda m3408        33 Apr 11  2022 nmdc_mga0v1h344_covstats.txt.md5
-rw-r--r-- 2 nmdcda m3408 551315890 Apr 11  2022 nmdc_mga0v1h344_pairedMapped.sam.gz
-rw-r--r-- 2 nmdcda m3408 566541523 Apr 11  2022 nmdc_mga0v1h344_pairedMapped_sorted.bam
-rw-r--r-- 2 nmdcda m3408        33 Apr 11  2022 nmdc_mga0v1h344_pairedMapped_sorted.bam.md5
-rw-r--r-- 2 nmdcda m3408  12259622 Apr 11  2022 nmdc_mga0v1h344_scaffolds.fna
-rw-r--r-- 2 nmdcda m3408        33 Apr 11  2022 nmdc_mga0v1h344_scaffolds.fna.md5
mbthornton-lbl commented 1 month ago

Applied database updates.json to Prod: { "ok": 1, "n": 790, "nModified": 790, "upserted": null, "writeErrors": null }