icgc-argo / workflow-roadmap

Roadmap and management for genomic data processing
GNU Affero General Public License v3.0
1 stars 0 forks source link

Collab Transfer: ICGC DCC RNA data to ICGC ARGO #200

Closed akachru-github closed 2 years ago

akachru-github commented 3 years ago
hknahal commented 3 years ago

Summary of Available RNA-Seq data to transfer from Collab to ARGO Total Donors: 589 Total Size: 15.91 TB

ICGC Project Code Project Name Total Donors Size File Format
CLLE-ES Chronic Lymphocytic Leukemia - Spain 131 2.06 TB BAM and FASTQ
MALY-DE Malignant Lymphoma - Germany 114 3.91 TB BAM and FASTQ
RECA-EU Renal Cell Cancer - EU/France 91 1.90 TB BAM
LIRI-JP Liver Cancer - RIKEN, Japan 67 1.98 TB BAM
OV-AU Ovarian Cancer - Australia 71 3.35 TB BAM
PACA-AU Pancreatic Cancer - Australia 78 1.57 TB BAM
PACA-CA Pancreatic Cancer - Canada 30 1.13 TB FASTQ
ESAD-UK Esophageal Adenocarcinoma - UK 7 12 GB BAM
lindaxiang commented 3 years ago

@hknahal are these ICGC projects all approved to be carried over to ARGO?

hknahal commented 3 years ago

@lindaxiang ESAD-UK, PACA-CA and PACA-AU confirmed their RNA-Seq data is okay to transfer, but I will confirm about the other projects.

hknahal commented 3 years ago

Example metadata for a BAM file in Collab (https://dcc.icgc.org/repositories/files/FI660142):

Metadata about file in Portal: https://dcc.icgc.org/api/v1/repository/files/FI660142 Metadata from XML file: https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb1c0351-e974-5cbe-a53d-b35f2cd9d61f

Using ARGO RNA-Seq Metadata Fields Definition:

Field Value Comment
studyId PACA-AU
analysisType "rna_sequencing_experiment" by default
submitterSampleId 8012195
matchedNormalSubmitterSampleId Missing It's missing in the Portal too ("Matched Control Sample ID" is empty in "Donor" table at https://dcc.icgc.org/repositories/files/FI660142)
sampleType Assuming it would be "Total RNA"
submitterSpecimenId 8012195
tumourNormalDesignation Tumour
specimenTissueSource Solid Tissue
specimenType "specimenType" field is incorrect in Portal metadata file The metadata (https://dcc.icgc.org/api/v1/repository/files/FI660142) says "Primary tumour", but it should actually say "Primary tumour - solid tissue"
gender Missing in metadata Would need to use Portal API to get gender information
submitterDonorId ICGC_0007
fileName PCAWG.319bd156-49a9-4341-a9f7-ca68b02d3ab2.TopHat2.v1.bam
fileSize 12483121728 in Portal metadata file
fileMd5sum 1498848054 in Portal metadata file
fileType BAM in Portal metadata file
fileAccess controlled "access" field in Portal metadata file
dataType "Submitted Reads" by default
data_category "Sequencing Reads" by default
experimental_strategy RNA-Seq
library_isolation_protocol Missing
library_preparation_kit Missing
library_stranded Missing
rin Missing
dv200 Missing
spike_ins_included Missing
spike_ins_fasta Missing
spike_ins_concentration Missing
platform ILLUMINA
platform_model Illumina HiSeq 2000 "Platform" in XML metadata
sequencing_center QCMG "EXPERIMENT center_name="QCMG"" in XML metadata
sequencing_date 2015-09-20T04:44:19 "analysis_date" in XML metadata
submittersequencing experiment_id
read_group_count 4
file_r1
file_r2
insert_size Missing
is_paired_end Not explicitly stored anywhere but can tell from @PG it is paired
@RG ID:QCMG:a54376a4-feee-11e4-a2c2-a013b0e7fe91:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC.1  LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T   PL:ILLUMINA PM:Illumina HiSeq 2000  PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC   SM:8012195
@RG ID:QCMG:55909fb4-ff9e-11e4-afd3-9c6cddfbf094:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC.1  LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T   PL:ILLUMINA PM:Illumina HiSeq 2000  PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC   SM:8012195
@RG ID:QCMG:15787c0c-fef0-11e4-8e24-e504abb0f656:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC.1  LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T   PL:ILLUMINA PM:Illumina HiSeq 2000  PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC   SM:8012195
@RG ID:QCMG:af31b27c-fef1-11e4-b1af-e17eb5105e18:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC.1  LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T   PL:ILLUMINA PM:Illumina HiSeq 2000  PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC   SM:8012195
Field Value
library_name Exists in "LB" tag in header above
platform_unit Exists in "PU" tag in header above
read_length_r1
read_length_r2
sample_barcode Missing
read_group_id_in_bam Exists in "@RG" tag in header
submitterread group_id
hknahal commented 3 years ago

Example metadata for a BAM file in Collab (https://dcc.icgc.org/repositories/files/FI803956):

Metadata about file in Portal: https://dcc.icgc.org/api/v1/repository/files/FI803956 Metadata from XML file: https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb1c0351-e974-5cbe-a53d-b35f2cd9d61f

Using ARGO RNA-Seq Metadata Fields Definition:

Field Value Comment
studyId PACA-CA
analysisType "rna_sequencing_experiment" by default
submitterSampleId MPCC_0008_Pa_C
matchedNormalSubmitterSampleId Missing It's missing in the Portal too ("Matched Control Sample ID" is empty in "Donor" table at https://dcc.icgc.org/repositories/files/FI803956)
sampleType Assuming it would be "Total RNA"
submitterSpecimenId MPCC_0008_Pa_C
tumourNormalDesignation Tumour
specimenTissueSource Other
specimenType Cell line - derived from tumour
gender Missing in metadata Would need to use Portal API to get gender information
submitterDonorId ICGC_0007
fileName 00e67dcfa0019557073eb0fbf84f8156.SWID_218778_MPCC_0008_Pa_C_PE_263_MR_120529_SN7001205_0082_BC0W00ACXX_NoIndex_L001_R2_001.fastq.gz
fileSize 21330847733 in Portal metadata file
fileMd5sum 00e67dcfa0019557073eb0fbf84f8156 in Portal metadata file
fileType FASTQ in Portal metadata file
fileAccess controlled "access" field in Portal metadata file
dataType "Submitted Reads" by default
data_category "Sequencing Reads" by default
experimental_strategy RNA-Seq
library_isolation_protocol Missing
library_preparation_kit Missing
library_stranded Missing
rin Missing
dv200 Missing
spike_ins_included Missing
spike_ins_fasta Missing
spike_ins_concentration Missing
platform ILLUMINA
platform_model Illumina HiSeq 2500 "Platform" in XML metadata
sequencing_center OICR_ICGC "EXPERIMENT center_name="OICR_ICGC"" in XML metadata
sequencing_date 2012-05-29T00:00:00 "analysis_date" in XML metadata
submittersequencing experiment_id
read_group_count
file_r1
file_r2
insert_size 263 Indicated by PAIRED NOMINAL LENGTH tag in https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb52bea7-1f50-5651-9c4d-a56c59aed4b0
is_paired_end Yes Indicated by PAIRED NOMINAL LENGTH tag in https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb52bea7-1f50-5651-9c4d-a56c59aed4b0
akachru-github commented 2 years ago

Blocked on DACO access.

akachru-github commented 2 years ago

DACO Access has been granted back to Hardeep. This is now unblocked.

hknahal commented 2 years ago

Preliminary PACA-AU RNA-Seq intermediate SONG payloads at: https://github.com/icgc-argo/argo-meta/tree/paca-au_rna-seq/icgc_song_payloads/APGI-AU/RNA-Seq/batch1

Summary of issues: https://github.com/icgc-argo/argo-meta/issues/37

akachru-github commented 2 years ago

Data transfer to Collab is complete. This ticket can be closed. Next step is to imiport data into Song.