Consolidation of multiple C2M2 Submission Generator issues(2Q2021 submission)

Introduction

The following consolidates multiple C2M2 Submission Generator issues into one.

Troubleshooting details for the 2Q2021 C2M2 submission from which these issues were distilled are at C2M2 Submission Checklist_and_investigation 2Q2021.
All KF artifacts from the 2Q2021 testing are located here and the final successful
The final successful submission zip of C2M2 files for 2Q2021 is c2m2_submission_20210610_10.zip

Further CFDE background information on C2M2 are accessible at:

C2M2 Wiki: https://github.com/nih-cfde/published-documentation/wiki
Technical Documentation: https://docs.nih-cfde.org/en/latest/

Issues

1. Make corrections to C2M2 Submission Generator README example execution code / instructions:

correct virtual environment syntax
correct cloning syntax for branch cloning
clarify that a Docker container does not need to be used
clarify that user must be logged onto CHOP network on VPN or must be on CHOP network on site when executing code
Make sure to call out that any errors noted during prerequisite install need to be addressed. For example NOTE: References for additional installs on pypi.org to resolve errors below: • Click 7.1.1: https://pypi.org/project/click/7.1.1/ • Globus-sdk: https://pypi.org/project/globus-sdk/

Suggested revising example code in readme to following:

python3 -m venv virtualenv2
source virtualenv2/bin/activate
git clone -b c2m2-loader  https://github.com/kids-first/kf-api-study-creator.git
cd kf-api-study-creator
pip install -r requirements.txt
pip install -r dev-requirements.txt
DATASERVICE_URL=https://kf-api-dataservice.kidsfirstdrc.org ./manage.py c2m2_load

C2M2 Submission Generator bug fix required: Code must validate that any subject in the biosample_from_subject.tsv is in the subject.tsv table. Do not include any biosample_from_subject records with orphan subject records.
C2M2 Submission Generator bug fix required: Code must validate that any file in the file_describes_biosample.tsv is in the file.tsv table. Do not include any file_describes_biosample records with orphan file records.
C2M2 Submission Generator bug fix required: Code must only include distinct namespaces in the namespace.tsv. Do not include any duplicate namespace records in the namespace.tsv file.
C2M2 Submission Generator bug fix required: Code must ensure that the primary_dcc_contact.tsv must include data. Do not generate a blank primary_dcc_contact.tsv file.
C2M2 Submission Generator bug fix required: Replace the out-of-date JSON copy with the most recent CFDE approved C2M2_datapackage.json version.
C2M2 Submission Generator bug fix required: Code should incorporate the CFDE frictionless validator which validates that key constraints are maintained in the C2M2 submission files https://github.com/nih-cfde/published-documentation/wiki/Quickstart#optional-frictionless
C2M2 Submission Generator bug fix required: Code must validate that file_describes_biosample.tsv must only include records having corresponding biosample records in the biosample.tsv. Do not include any file_describes_biosample.tsv records with no corresponding biosample records.
C2M2 Submission Generator bug fix required: Enhance code to include assay_type in file.tsv as a valid OBI ID (Note: OBI Lookup Service http://www.ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0000070) based on the KF to "experimental strategy" . Include a distinct list of all file.tsv assay_types in the assay_type.tsv Consider how to add new OBI id's in the future for new experimental strategies. NOTE: option to incorporate CFDE build term tables code that will automatically generate the assay_type.tsv table

Currently -- KF Experimental Strategies and OBI id WGS (Whole Genome Sequencing) is http://purl.obolibrary.org/obo/OBI_0002117 (OBI:0002117) WXS (Whole Exome Sequencing) is http://purl.obolibrary.org/obo/OBI_0002118 (OBI:0002118) miRNA ("small RNA sequencing assay" ) http://purl.obolibrary.org/obo/OBI_0002112 (OBI:0002112) No associated OBI chosen / blank: Linked-Read WGS (10x Chromium); and Targeted Sequencing

C2M2 Submission Generator bug fix required: Enhance code to include anatomy in the biosample.tsv as Uberon ID (Note: Uberon lookup service https://www.ebi.ac.uk/ols/ontologies/uberon ) Include a distinct list of all anatomy entries from biosample.tsv in the anatomy.tsv table. Refer to the list of Uberon ID mappings for this distinct list of separately exported composition descriptions (Uberon Mapping tab / list) Consider how to add new Uberon ID's or build the Uberon ID mappings upstream based on composition so that KF biosample uberon_id_anatomical_site is populated. NOTE: option to incorporate CFDE build term tables code that will automatically generate the anatomy.tsv table
C2M2 Submission Generator - enhance to include Homo sapiens NCBI ID--add taxonomy_id to subject_role_taxonomy.tsv (all records from subject.tsv, role_id cfde_subject_role:0 and taxonomy_id NCBI:txid9606). The ncbi_taxonomy.tsv needs to have an entry for Homo sapiens. NOTE: option to incorporate CFDE build term tables code that will automatically generate the ncbi_taxonomy.tsv table
C2M2 Submission Generator bug fix required: Resolve issue with orphan records lacking species
C2M2 Submission Generator bug fix required: Replace hard-coded KF study list in globals.py with a dynamic pull of studies on the portal. Define and harden the logic for this to ensure that the studies returned are those that are viewable on the Kids First Portal.
C2M2 Submission Generator: harden code to eliminate redundant manual steps, optimize code execution and limit timeouts, and improve logging to capture step execution progress and completion status of export by file. Since this involves bulk data exports reexamine whether the best approach is to use the Data Service for this, or whether direct queries against the Data Service backend database are preferred.
C2M2 Submission Generator: test and refactor code that determines what a given file type is and with which EDAM ontology (format::) id the file should be associated.There are cases where files have [filename].bam.bai, [filename].cram.crai, [filename].vcf.tbi but the current code is interpreting those files as .bam, .cram, .vcf, respectively. Other examples may also exist with different file extensions.
C2M2 Submission Generator: enhance the Data Service and the code to pull accurate md5 hashes for the files in the md5 field of the file.tsv

kids-first / kf-api-study-creator

Consolidation of multiple C2M2 Submission Generator issues(2Q2021 submission) #733

Introduction

Issues