kids-first / kf-api-study-creator

📂 Powers investigator-driven data staging. Backend for Data Tracker app
https://kids-first.github.io/kf-api-study-creator
Apache License 2.0
1 stars 0 forks source link

Consolidation of multiple C2M2 Submission Generator issues(2Q2021 submission) #733

Open ericwenger-pm opened 3 years ago

ericwenger-pm commented 3 years ago

Introduction

The following consolidates multiple C2M2 Submission Generator issues into one.

Further CFDE background information on C2M2 are accessible at:

Issues

1. Make corrections to C2M2 Submission Generator README example execution code / instructions:

Suggested revising example code in readme to following:

python3 -m venv virtualenv2
source virtualenv2/bin/activate
git clone -b c2m2-loader  https://github.com/kids-first/kf-api-study-creator.git
cd kf-api-study-creator
pip install -r requirements.txt
pip install -r dev-requirements.txt
DATASERVICE_URL=https://kf-api-dataservice.kidsfirstdrc.org ./manage.py c2m2_load
  1. C2M2 Submission Generator bug fix required: Code must validate that any subject in the biosample_from_subject.tsv is in the subject.tsv table. Do not include any biosample_from_subject records with orphan subject records.

  2. C2M2 Submission Generator bug fix required: Code must validate that any file in the file_describes_biosample.tsv is in the file.tsv table. Do not include any file_describes_biosample records with orphan file records.

  3. C2M2 Submission Generator bug fix required: Code must only include distinct namespaces in the namespace.tsv. Do not include any duplicate namespace records in the namespace.tsv file.

  4. C2M2 Submission Generator bug fix required: Code must ensure that the primary_dcc_contact.tsv must include data. Do not generate a blank primary_dcc_contact.tsv file.

  5. C2M2 Submission Generator bug fix required: Replace the out-of-date JSON copy with the most recent CFDE approved C2M2_datapackage.json version.

  6. C2M2 Submission Generator bug fix required: Code should incorporate the CFDE frictionless validator which validates that key constraints are maintained in the C2M2 submission files https://github.com/nih-cfde/published-documentation/wiki/Quickstart#optional-frictionless

  7. C2M2 Submission Generator bug fix required: Code must validate that file_describes_biosample.tsv must only include records having corresponding biosample records in the biosample.tsv. Do not include any file_describes_biosample.tsv records with no corresponding biosample records.

  8. C2M2 Submission Generator bug fix required: Enhance code to include assay_type in file.tsv as a valid OBI ID (Note: OBI Lookup Service http://www.ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0000070) based on the KF to "experimental strategy" . Include a distinct list of all file.tsv assay_types in the assay_type.tsv Consider how to add new OBI id's in the future for new experimental strategies. NOTE: option to incorporate CFDE build term tables code that will automatically generate the assay_type.tsv table

Currently -- KF Experimental Strategies and OBI id WGS (Whole Genome Sequencing) is http://purl.obolibrary.org/obo/OBI_0002117 (OBI:0002117) WXS (Whole Exome Sequencing) is http://purl.obolibrary.org/obo/OBI_0002118 (OBI:0002118) miRNA ("small RNA sequencing assay" ) http://purl.obolibrary.org/obo/OBI_0002112 (OBI:0002112) No associated OBI chosen / blank: Linked-Read WGS (10x Chromium); and Targeted Sequencing

  1. C2M2 Submission Generator bug fix required: Enhance code to include anatomy in the biosample.tsv as Uberon ID (Note: Uberon lookup service https://www.ebi.ac.uk/ols/ontologies/uberon ) Include a distinct list of all anatomy entries from biosample.tsv in the anatomy.tsv table. Refer to the list of Uberon ID mappings for this distinct list of separately exported composition descriptions (Uberon Mapping tab / list) Consider how to add new Uberon ID's or build the Uberon ID mappings upstream based on composition so that KF biosample uberon_id_anatomical_site is populated. NOTE: option to incorporate CFDE build term tables code that will automatically generate the anatomy.tsv table

  2. C2M2 Submission Generator - enhance to include Homo sapiens NCBI ID--add taxonomy_id to subject_role_taxonomy.tsv (all records from subject.tsv, role_id cfde_subject_role:0 and taxonomy_id NCBI:txid9606). The ncbi_taxonomy.tsv needs to have an entry for Homo sapiens. NOTE: option to incorporate CFDE build term tables code that will automatically generate the ncbi_taxonomy.tsv table

  3. C2M2 Submission Generator bug fix required: Resolve issue with orphan records lacking species

  4. C2M2 Submission Generator bug fix required: Replace hard-coded KF study list in globals.py with a dynamic pull of studies on the portal. Define and harden the logic for this to ensure that the studies returned are those that are viewable on the Kids First Portal.

  5. C2M2 Submission Generator: harden code to eliminate redundant manual steps, optimize code execution and limit timeouts, and improve logging to capture step execution progress and completion status of export by file. Since this involves bulk data exports reexamine whether the best approach is to use the Data Service for this, or whether direct queries against the Data Service backend database are preferred.

  6. C2M2 Submission Generator: test and refactor code that determines what a given file type is and with which EDAM ontology (format::) id the file should be associated.There are cases where files have [filename].bam.bai, [filename].cram.crai, [filename].vcf.tbi but the current code is interpreting those files as .bam, .cram, .vcf, respectively. Other examples may also exist with different file extensions.

  7. C2M2 Submission Generator: enhance the Data Service and the code to pull accurate md5 hashes for the files in the md5 field of the file.tsv

ericwenger-pm commented 3 years ago

@dankolbman assigned this to you initially, however, also adding @chris-s-friedman as CFDE C2M2 defined resource. We can further discuss how the work on bug fixes / enhancements can be coordinated.