microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Re-ID, Ingest to `Napa` DB, and Verify Napa compliance for "SPRUCE" study `nmdc:sty-11-33fbta56` #1789

Closed mbthornton-lbl closed 3 months ago

mbthornton-lbl commented 5 months ago

Note: Scope of this work is the Napa Database Instance. The same steps will need to be repeated in a prod-ready environment

For the "SPRUCE" Study - id: nmdc:sty-11-33fbta56 legacy id: gold:Gs0110138

mbthornton-lbl commented 5 months ago

delete-old-records continues to have some issues going through the queries:run endpoint. Each deletion can be several seconds, and we are getting http dropouts like so:


INFO:root:Deleting None record: nmdc:58e0fa8e0426f18fd6f5fda52b90a57d
INFO:root:An error occured while running: {'delete': 'data_object_set', 'deletes': [{'q': {'id': 'nmdc:58e0fa8e0426f18fd6f5fda52b90a57d'}, 'limit': 1}]}, response retutrned: HTTPSConnectionPool(host='api-napa.microbiomedata.org', port=443): Max retries exceeded with url: /queries:run (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x105b0ec40>: Failed to resolve 'api-napa.microbiomedata.org' ([Errno 8] nodename nor servname provided, or not known)"))
mbthornton-lbl commented 5 months ago

@corilo There is one issue with a mags_activity data object:

/Users/MBThornton/Library/Caches/pypoetry/virtualenvs/nmdc-schema-22PpSQmj-py3.9/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO:root:Using SchemaView with im=None
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] Additional properties are not allowed ('detail' was unexpected) in /data_object_set/1678
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'id' is a required property in /data_object_set/1678
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'name' is a required property in /data_object_set/1678
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'description' is a required property in /data_object_set/1678
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'nmdc:78f8bf24916f01d053378b1bd464cd8a' does not match '^(nmdc):wfmag-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\\.[A-Za-z0-9]{1,})*(_[A-Za-z0-9_\\.-]+)?$' in /mags_activity_set/0/id
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'nmdc:061846685755316cd5f20d4035212ba1' does not match '^(nmdc):wfmag-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\\.[A-Za-z0-9]{1,})*(_[A-Za-z0-9_\\.-]+)?$' in /mags_activity_set/1/id
mbthornton-lbl commented 5 months ago

After fixing bug in delete-old-records command:

/INFO:root:Using SchemaView with im=None [ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] Additional properties are not allowed ('detail' was unexpected) in /data_object_set/1675 [ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'id' is a required property in /data_object_set/1675 [ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'name' is a required property in /data_object_set/1675 [ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'description' is a required property in /data_object_set/1675

mbthornton-lbl commented 5 months ago

linkml-convert fails

mbthornton-lbl commented 4 months ago

Re-extracted this study after the fix in #1809 and re-ran linkml-validate

(nmdc-schema-py3.9) (base) MBThornton@MBThornton-M92 nmdc-schema % linkml-validate -s ./local/nmdc-schema-v8.0.0.yaml ./local/nmdc:sty-11-33fbta56.yaml 

INFO:root:Using SchemaView with im=None
No issues found
(nmdc-schema-py3.9) (base) MBThornton@MBThornton-M92 nmdc-schema % 
mbthornton-lbl commented 4 months ago

@corilo There are 50 NOM Workflows with DataObject IDs that are not in the DB:

(nmdc-schema-py3.9) (base) MBThornton@MBThornton-M92 nmdc-schema % grep ERROR ./local/nmdc:sty-11-33fbta56.log
11:02:49,703 root ERROR MissingDataObject nmdc:dobj-11-ztbv2p46 for None / nmdc:wfnom-11-smcr9z56not found  in the NMDC database.
11:02:50,326 root ERROR MissingDataObject nmdc:dobj-11-x5xjqd77 for None / nmdc:wfnom-11-pjq1np68not found  in the NMDC database.
11:02:50,905 root ERROR MissingDataObject nmdc:dobj-11-0cbpm036 for None / nmdc:wfnom-11-558bnt86not found  in the NMDC database.
11:02:51,484 root ERROR MissingDataObject nmdc:dobj-11-3g333768 for None / nmdc:wfnom-11-1bppxe39not found  in the NMDC database.
11:02:52,50 root ERROR MissingDataObject nmdc:dobj-11-mqxpdk90 for None / nmdc:wfnom-11-wsstke35not found  in the NMDC database.
11:02:52,685 root ERROR MissingDataObject nmdc:dobj-11-2mf0ck30 for None / nmdc:wfnom-11-sdtmv824not found  in the NMDC database.
11:02:53,260 root ERROR MissingDataObject nmdc:dobj-11-ptn3y828 for None / nmdc:wfnom-11-hd7jqk13not found  in the NMDC database.
11:02:53,855 root ERROR MissingDataObject nmdc:dobj-11-5078ek40 for None / nmdc:wfnom-11-a3btbx96not found  in the NMDC database.
11:02:54,454 root ERROR MissingDataObject nmdc:dobj-11-2es9cb20 for None / nmdc:wfnom-11-18ysx178not found  in the NMDC database.
11:02:55,46 root ERROR MissingDataObject nmdc:dobj-11-yzw19k52 for None / nmdc:wfnom-11-xb2spg83not found  in the NMDC database.
11:02:55,657 root ERROR MissingDataObject nmdc:dobj-11-xey2px87 for None / nmdc:wfnom-11-r5249v25not found  in the NMDC database.
11:02:56,261 root ERROR MissingDataObject nmdc:dobj-11-mm4v3w81 for None / nmdc:wfnom-11-1x95zh14not found  in the NMDC database.
11:02:56,878 root ERROR MissingDataObject nmdc:dobj-11-0wzvat27 for None / nmdc:wfnom-11-ba9mqz70not found  in the NMDC database.
11:02:57,485 root ERROR MissingDataObject nmdc:dobj-11-2pgsp333 for None / nmdc:wfnom-11-8w8t3j79not found  in the NMDC database.
11:02:58,90 root ERROR MissingDataObject nmdc:dobj-11-57yeca15 for None / nmdc:wfnom-11-jzm23e74not found  in the NMDC database.
11:02:58,698 root ERROR MissingDataObject nmdc:dobj-11-k9jxrf86 for None / nmdc:wfnom-11-4yejyf26not found  in the NMDC database.
11:02:59,278 root ERROR MissingDataObject nmdc:dobj-11-d2026a83 for None / nmdc:wfnom-11-yhf3ba83not found  in the NMDC database.
11:02:59,879 root ERROR MissingDataObject nmdc:dobj-11-aj065031 for None / nmdc:wfnom-11-k817cz71not found  in the NMDC database.
11:03:00,476 root ERROR MissingDataObject nmdc:dobj-11-5nxka964 for None / nmdc:wfnom-11-h2w6je98not found  in the NMDC database.
11:03:01,96 root ERROR MissingDataObject nmdc:dobj-11-d8qyvj73 for None / nmdc:wfnom-11-1aevmd70not found  in the NMDC database.
11:03:01,698 root ERROR MissingDataObject nmdc:dobj-11-7v0f8v61 for None / nmdc:wfnom-11-ht2xnt32not found  in the NMDC database.
11:03:02,297 root ERROR MissingDataObject nmdc:dobj-11-gwrd8z56 for None / nmdc:wfnom-11-z51yze33not found  in the NMDC database.
11:03:02,901 root ERROR MissingDataObject nmdc:dobj-11-1hqhzh12 for None / nmdc:wfnom-11-t0v70190not found  in the NMDC database.
11:03:03,503 root ERROR MissingDataObject nmdc:dobj-11-3jgncm19 for None / nmdc:wfnom-11-pjgqr851not found  in the NMDC database.
11:03:04,72 root ERROR MissingDataObject nmdc:dobj-11-5b81zh83 for None / nmdc:wfnom-11-hrh82f26not found  in the NMDC database.
11:03:04,656 root ERROR MissingDataObject nmdc:dobj-11-jf77qb43 for None / nmdc:wfnom-11-0ddtxb36not found  in the NMDC database.
11:03:05,262 root ERROR MissingDataObject nmdc:dobj-11-3093qg91 for None / nmdc:wfnom-11-2qeh9y42not found  in the NMDC database.
11:03:05,848 root ERROR MissingDataObject nmdc:dobj-11-vjz44869 for None / nmdc:wfnom-11-5z6a8f80not found  in the NMDC database.
11:03:06,457 root ERROR MissingDataObject nmdc:dobj-11-kc2pzt82 for None / nmdc:wfnom-11-ktx45570not found  in the NMDC database.
11:03:07,43 root ERROR MissingDataObject nmdc:dobj-11-e00t7r76 for None / nmdc:wfnom-11-x5hffc07not found  in the NMDC database.
11:03:07,653 root ERROR MissingDataObject nmdc:dobj-11-3fj82w22 for None / nmdc:wfnom-11-nevepm94not found  in the NMDC database.
11:03:08,242 root ERROR MissingDataObject nmdc:dobj-11-aw54dn02 for None / nmdc:wfnom-11-j046q055not found  in the NMDC database.
11:03:08,834 root ERROR MissingDataObject nmdc:dobj-11-qaqakp19 for None / nmdc:wfnom-11-xrabz763not found  in the NMDC database.
11:03:09,434 root ERROR MissingDataObject nmdc:dobj-11-0r0g2p17 for None / nmdc:wfnom-11-r14yhp90not found  in the NMDC database.
11:03:10,22 root ERROR MissingDataObject nmdc:dobj-11-35jght91 for None / nmdc:wfnom-11-z80x9z14not found  in the NMDC database.
11:03:10,615 root ERROR MissingDataObject nmdc:dobj-11-7eyhe417 for None / nmdc:wfnom-11-48614c11not found  in the NMDC database.
11:03:11,208 root ERROR MissingDataObject nmdc:dobj-11-7d139n02 for None / nmdc:wfnom-11-2atr5z74not found  in the NMDC database.
11:03:11,830 root ERROR MissingDataObject nmdc:dobj-11-3h5th638 for None / nmdc:wfnom-11-7xtfec25not found  in the NMDC database.
11:03:12,428 root ERROR MissingDataObject nmdc:dobj-11-c513t353 for None / nmdc:wfnom-11-0my70030not found  in the NMDC database.
11:03:13,36 root ERROR MissingDataObject nmdc:dobj-11-bbsq0e70 for None / nmdc:wfnom-11-1yz7tz37not found  in the NMDC database.
11:03:13,634 root ERROR MissingDataObject nmdc:dobj-11-x2am3c25 for None / nmdc:wfnom-11-wv3pt090not found  in the NMDC database.
11:03:14,222 root ERROR MissingDataObject nmdc:dobj-11-xb5n8830 for None / nmdc:wfnom-11-yryh7465not found  in the NMDC database.
11:03:14,834 root ERROR MissingDataObject nmdc:dobj-11-mb3sa895 for None / nmdc:wfnom-11-v0hfgv41not found  in the NMDC database.
11:03:15,414 root ERROR MissingDataObject nmdc:dobj-11-b83b0k89 for None / nmdc:wfnom-11-j6h7q390not found  in the NMDC database.
11:03:16,23 root ERROR MissingDataObject nmdc:dobj-11-bakewf83 for None / nmdc:wfnom-11-669h3e47not found  in the NMDC database.
11:03:16,638 root ERROR MissingDataObject nmdc:dobj-11-ze8mwq93 for None / nmdc:wfnom-11-6z55r709not found  in the NMDC database.
11:03:17,256 root ERROR MissingDataObject nmdc:dobj-11-gw7vh076 for None / nmdc:wfnom-11-zr5g4x66not found  in the NMDC database.
11:03:17,871 root ERROR MissingDataObject nmdc:dobj-11-b09kzm65 for None / nmdc:wfnom-11-8nje0s46not found  in the NMDC database.
11:03:18,481 root ERROR MissingDataObject nmdc:dobj-11-yfh14e47 for None / nmdc:wfnom-11-vh50e238not found  in the NMDC database.
11:03:20,329 root ERROR MissingDataObject nmdc:dobj-11-s67n8g83 for None / nmdc:wfnom-11-f7cgc846not found  in the NMDC database.
mbthornton-lbl commented 4 months ago

linkml-validate vs. schema v10.1.10 has 2 errors:

INFO:root:Using SchemaView with im=None
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] Additional properties are not allowed ('award_dois' was unexpected) in /study_set/0
[ERROR] [./local/nmdc:sty-11-33fbta56.yaml/0] 'study_category' is a required property in /study_set/0
mbthornton-lbl commented 4 months ago

@corilo has updated the NOM DataObjects - results of extract-study now:

Extracted studies: ['gold:Gs0110138', 'nmdc:sty-11-33fbta56'] from the NMDC database in 0:10:13.010635.
No orphaned data objects found.
No missing data objects found.
Writing results to /Users/MBThornton/Documents/code/nmdc-schema/local/nmdc:sty-11-33fbta56.yaml.
mbthornton-lbl commented 4 months ago

Schema v10 compatibility issues will be addressed by: https://github.com/microbiomedata/nmdc_automation/issues/66

aclum commented 4 months ago

{'description':{$regex:/Gp0208377/}} on data_object_set returns 7 legacy data objects records

ssarrafan commented 4 months ago

@aclum should this re-opened issue be moved to the next sprint?

aclum commented 4 months ago

@mbthornton-lbl There are still legacy data objects "tooShort (< 3kb) filtered contigs fasta file by metaBat2 for $GOLD_PROJECT_ID". I thought we re-added the logic to clean up files based on the gold names.