kids-first / kf-cbioportal-etl

Repository of scripts used to convert dataservice metadata and DRC processed data into cbio portal loading format.
Apache License 2.0
2 stars 0 forks source link

Outline on ETL for converting data from CAVATICA and Data Warehouse to PedcBioportal format

In general, we are creating upload packages converting our data and metadata to satisfy the requirements outlined here. Further general loading notes can be found in this Notion page. See below for special cases like publications or collaborative efforts

I have everything and I know I am doing

Below assumes you have already created the necessary tables from dbt

  1. Run commands as outlined in scripts/get_study_metadata.py. Copy/move those files to the cBio loader ec2 instance

  2. Recommended, but not required: run scripts/diff_studies.py. It will give a summary of metadata changes between what is currently loaded and what you plan to load, to potentially flag any suspicious changes

  3. Copy over the appropriate aws account key and download files. Example using pbta_all study:

    python3 scripts/get_files_from_manifest.py -m cbtn_genomics_file_manifest.txt,pnoc_genomics_file_manifest.txt,x01_genomics_file_manifest.txt,dgd_genomics_file_manifest.txt -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_pval,ctrlfreec_info,ctrlfreec_bam_seg,annotated_public -t aws_buckets_key_pairs.txt -s turbo -c cbio_file_name_id.txt -a

    aws_bucket_key_pairs.txt is a headerless tsv file with bucket name and aws profile name pairs

  4. Copy and edit REFS/data_processing_config.json and REFS/pbta_all_case_meta_config.json as needed

  5. Run pipeline script - ignore manifest section, it is a placeholder for a better function download method

    scripts/genomics_file_cbio_package_build.py -t cbio_file_name_id.txt -c pbta_all_case_meta_config.json -d data_processing_config.json -f both
  6. Check logs and outputs for errors, especially validator.errs and validator.out, assuming everything else went fine, to see if any ERROR popped up that would prevent the pakcage from loading properly once pushed to the bucket and Jenkins import job is run

Final output example

In the end, if you named your output dir processed, you'll end up with this example output from pbta_all study:

processed
└── pbta_all
    ├── case_lists
│   ├── cases_3way_complete.txt
│   ├── cases_RNA_Seq_v2_mRNA.txt
│   ├── cases_all.txt
│   ├── cases_cna.txt
│   ├── cases_cnaseq.txt
│   ├── cases_sequenced.txt
│   └── cases_sv.txt
├── data_CNA.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_cnvs/pbta_all.discrete_cnvs.txt
├── data_clinical_patient.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_patient.txt
├── data_clinical_sample.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_sample.txt
├── data_clinical_timeline_clinical_event.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_timeline_clinical_event.txt
├── data_clinical_timeline_imaging.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_timeline_imaging.txt
├── data_clinical_timeline_specimen.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_timeline_specimen.txt
├── data_clinical_timeline_surgery.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_timeline_surgery.txt
├── data_clinical_timeline_treatment.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/datasheets/data_clinical_timeline_treatment.txt
├── data_cna.seg.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_cnvs/pbta_all.merged_seg.txt
├── data_linear_CNA.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_cnvs/pbta_all.predicted_cnv.txt
├── data_mutations_extended.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_mafs/pbta_all.maf
├── data_rna_seq_v2_mrna.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_rsem/pbta_all.rsem_merged.txt
├── data_rna_seq_v2_mrna_median_Zscores.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_rsem/pbta_all.rsem_merged_zscore.txt
├── data_sv.txt -> /home/ubuntu/volume/PORTAL_LOADS/pbta_all/merged_fusion/pbta_all.fusions.txt
├── meta_CNA.txt
├── meta_clinical_patient.txt
├── meta_clinical_sample.txt
├── meta_clinical_timeline_clinical_event.txt
├── meta_clinical_timeline_imaging.txt
├── meta_clinical_timeline_specimen.txt
├── meta_clinical_timeline_surgery.txt
├── meta_clinical_timeline_treatment.txt
├── meta_cna.seg.txt
├── meta_linear_CNA.txt
├── meta_mutations_extended.txt
├── meta_rna_seq_v2_mrna.txt
├── meta_rna_seq_v2_mrna_median_Zscores.txt
├── meta_study.txt
└── meta_sv.txt

Note! Most other studies won't have a timeline set of files.

Details

Use this section as a reference in case your overconfidence got the best of you

REFS

In case you want to use different reference inputs...

Software Prerequisites

Starting file inputs

Most starting files are exported from the D3b Warehouse. An example of file exports can be found here scripts/export_clinical.sh, we now use scripts/get_study_metadata.py to get the files. However, a python wrapper script that leverages the x_case_meta_config.json is recommended to use for each study.

scripts/get_study_metadata.py

usage: get_study_metadata.py [-h] [-d DB_INI] [-p PROFILE] [-c CONFIG_FILE]

Pull clinical data and genomics file etl support from D3b data warehouse.

optional arguments:
  -h, --help            show this help message and exit
  -d DB_INI, --db-ini DB_INI
                        Database config file - formatting like aws or sbg creds
  -p PROFILE, --profile PROFILE
                        ini profile name
  -c CONFIG_FILE, --config CONFIG_FILE
                        json config file with meta information; see REFS/pbta_all_case_meta_config.json example
  -r REF_DIR, --ref-dir REF_DIR
                        dir name containing template data_clinical* header files

From D3b Warehouse

- Genomic files manifest

This is a s3 manifest of all files to loaded onto the portal. It is generally created by Bix-Ops and loaded into the D3b Warehouse. If the study is combining a KF/PBTA study with DGD, you may need to download a second manifest.

- Data clinical sample sheet

This is the cBioportal-formatted sample sheet that follows guidelines from here

- Data clinical patient sheet

This is the cBioportal-formatted patient sheet that follows guidelines from here

- Genomics metadata file

Seemingly redundant, this file contains the file locations, BS IDs, file type, and cBio-formatted sample IDs of all inputs. It helps simplify the process to integrate better into the downstream tools. This is the file that goes in as the -t arg in all the data collating tools

- Sequencing center info resource file

DEPRECATED and will be removed from future releases This is a simple file this BS IDs and sequencing center IDs and locations. It is necessary to patch in a required field for the fusion data

- Data gene matrix - OPTIONAL

This is only required if you have a custom panel - like the DGD does

User-edited

- Data processing config file

This is a json formatted file that has tool paths, reference paths, and run time params. An example is given in REFS/data_processing_config.json. This section here:

"file_loc_defs": {
    "_comment": "edit the values based on existing/anticipated source file locations, relative to working directory of the script being run",
    "mafs": {
      "kf": "annotated_public_outputs",
      "header": "/home/ubuntu/tools/kf-cbioportal-etl/REFS/maf_KF_CONSENSUS.txt"
    },
    "cnvs": {
      "pval": "ctrlfreec_pval",
      "info": "ctrlfreec_info",
      "seg": "ctrlfreec_bam_seg"
    },
    "rsem": "RSEM_gene",
    "fusion": "annofuse_filtered_fusions_tsv",
    "fusion_sq_file": ""
  },
  "dl_file_type_list": ["RSEM_gene","annofuse_filtered_fusions_tsv","annotated_public_outputs",
    "ctrlfreec_pval","ctrlfreec_info","ctrlfreec_bam_seg", "DGD_MAF"],

Will likely need the most editing existing based on your input, and should only need to updated if something changes after initial load.

- Metadata processing config file

This is a json config file with file descriptions and case lists required by the cbioportal. An example is given in REFS/pbta_all_case_meta_config.json. Within this file is a _doc section with a decent explanation of the file format and layout. Be sure to review all data types to be loaded by review all meta_* to see if they match incoming data. Likely personalized edits would occur in the following fields:

Pipeline script

After downloading the genomic files and files above as needed, and properly editing config files as needed, this script should generate and validate the cBioportal load package

scripts/get_files_from_manifest.py

Currently, file locations are still too volatile to trust to make downloading part of the pipeline. Using various combinations of buckets and sbg file ID pulls will eventually get you everything

usage: get_files_from_manifest.py [-h] [-m MANIFEST] [-f FTS] [-p PROFILE] [-s SBG_PROFILE] [-c CBIO] [-a] [-d]

Get all files for a project.

optional arguments:
  -h, --help            show this help message and exit
  -m MANIFEST, --manifest-list MANIFEST
                        csv list of of genomic file location manifests
  -f FTS, --file-types FTS
                        csv list of workflow types to download
  -p PROFILE, --profile PROFILE
                        aws profile name. Leave blank if using sbg instead
  -s SBG_PROFILE, --sbg-profile SBG_PROFILE
                        sbg profile name. Leave blank if using AWS instead
  -c CBIO, --cbio CBIO  Add cbio manifest to limit downloads
  -a, --active-only     Set to grab only active files. Recommended.
  -d, --debug           Just output manifest subset to see what would be grabbed

scripts/genomics_file_cbio_package_build.py

usage: genomics_file_cbio_package_build.py [-h] [-t TABLE] [-m MANIFEST] [-c CBIO_CONFIG] [-d DATA_CONFIG] [-f [{both,kf,dgd}]]

Download files (if needed), collate genomic files, organize load package.

optional arguments:
  -h, --help            show this help message and exit
  -t TABLE, --table TABLE
                        Table with cbio project, kf bs ids, cbio IDs, and file names
  -m MANIFEST, --manifest MANIFEST
                        Download file manifest, if needed
  -c CBIO_CONFIG, --cbio-config CBIO_CONFIG
                        cbio case and meta config file
  -d DATA_CONFIG, --data-config DATA_CONFIG
                        json config file with data types and data locations
  -f [{both,kf,dgd}], --dgd-status [{both,kf,dgd}]
                        Flag to determine load will have pbta/kf + dgd(both), kf/pbta only(kf), dgd-only(dgd)
  -l, --legacy          If set, will run legacy fusion output

Check the pipeline log output for any errors that might have occurred.

Upload the final packages

Upload all of the directories named as study short names to s3://kf-cbioportal-studies/public/. You may need to set and/or copy aws your saml key before uploading. Next, edit the file in that bucket called importStudies.txt located at s3://kf-cbioportal-studies/public/importStudies.txt, with the names of all of the studies you wish to updated/upload. Lastly, follow the directions reference in Software Prerequisites to load the study.

Congratulations, you did it!

Collaborative and Publication Workflows

These are highly specialized cases in which all or most of the data come from a third party, and therefore requires specific case-by-case protocols.

OpenPedCan

See OpenPedCan README

OpenPBTA

See OpenPBTA README