Closed akachru-github closed 2 years ago
Summary of Available RNA-Seq data to transfer from Collab to ARGO Total Donors: 589 Total Size: 15.91 TB
ICGC Project Code | Project Name | Total Donors | Size | File Format |
---|---|---|---|---|
CLLE-ES | Chronic Lymphocytic Leukemia - Spain | 131 | 2.06 TB | BAM and FASTQ |
MALY-DE | Malignant Lymphoma - Germany | 114 | 3.91 TB | BAM and FASTQ |
RECA-EU | Renal Cell Cancer - EU/France | 91 | 1.90 TB | BAM |
LIRI-JP | Liver Cancer - RIKEN, Japan | 67 | 1.98 TB | BAM |
OV-AU | Ovarian Cancer - Australia | 71 | 3.35 TB | BAM |
PACA-AU | Pancreatic Cancer - Australia | 78 | 1.57 TB | BAM |
PACA-CA | Pancreatic Cancer - Canada | 30 | 1.13 TB | FASTQ |
ESAD-UK | Esophageal Adenocarcinoma - UK | 7 | 12 GB | BAM |
@hknahal are these ICGC projects all approved to be carried over to ARGO?
@lindaxiang ESAD-UK, PACA-CA and PACA-AU confirmed their RNA-Seq data is okay to transfer, but I will confirm about the other projects.
Example metadata for a BAM file in Collab (https://dcc.icgc.org/repositories/files/FI660142):
Metadata about file in Portal: https://dcc.icgc.org/api/v1/repository/files/FI660142 Metadata from XML file: https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb1c0351-e974-5cbe-a53d-b35f2cd9d61f
Using ARGO RNA-Seq Metadata Fields Definition:
Field | Value | Comment |
---|---|---|
studyId | PACA-AU | |
analysisType | "rna_sequencing_experiment" by default | |
submitterSampleId | 8012195 | |
matchedNormalSubmitterSampleId | Missing | It's missing in the Portal too ("Matched Control Sample ID" is empty in "Donor" table at https://dcc.icgc.org/repositories/files/FI660142) |
sampleType | Assuming it would be "Total RNA" | |
submitterSpecimenId | 8012195 | |
tumourNormalDesignation | Tumour | |
specimenTissueSource | Solid Tissue | |
specimenType | "specimenType" field is incorrect in Portal metadata file | The metadata (https://dcc.icgc.org/api/v1/repository/files/FI660142) says "Primary tumour", but it should actually say "Primary tumour - solid tissue" |
gender | Missing in metadata | Would need to use Portal API to get gender information |
submitterDonorId | ICGC_0007 | |
fileName | PCAWG.319bd156-49a9-4341-a9f7-ca68b02d3ab2.TopHat2.v1.bam | |
fileSize | 12483121728 | in Portal metadata file |
fileMd5sum | 1498848054 | in Portal metadata file |
fileType | BAM | in Portal metadata file |
fileAccess | controlled | "access" field in Portal metadata file |
dataType | "Submitted Reads" by default | |
data_category | "Sequencing Reads" by default | |
experimental_strategy | RNA-Seq | |
library_isolation_protocol | Missing | |
library_preparation_kit | Missing | |
library_stranded | Missing | |
rin | Missing | |
dv200 | Missing | |
spike_ins_included | Missing | |
spike_ins_fasta | Missing | |
spike_ins_concentration | Missing | |
platform | ILLUMINA | |
platform_model | Illumina HiSeq 2000 | "Platform" in XML metadata |
sequencing_center | QCMG | "EXPERIMENT center_name="QCMG"" in XML metadata |
sequencing_date | 2015-09-20T04:44:19 | "analysis_date" in XML metadata |
submittersequencing experiment_id | ||
read_group_count | 4 | |
file_r1 | ||
file_r2 | ||
insert_size | Missing | |
is_paired_end | Not explicitly stored anywhere but can tell from @PG it is paired |
@RG ID:QCMG:a54376a4-feee-11e4-a2c2-a013b0e7fe91:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC.1 LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T PL:ILLUMINA PM:Illumina HiSeq 2000 PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_5.AGTTCC SM:8012195
@RG ID:QCMG:55909fb4-ff9e-11e4-afd3-9c6cddfbf094:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC.1 LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T PL:ILLUMINA PM:Illumina HiSeq 2000 PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_8.AGTTCC SM:8012195
@RG ID:QCMG:15787c0c-fef0-11e4-8e24-e504abb0f656:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC.1 LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T PL:ILLUMINA PM:Illumina HiSeq 2000 PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_6.AGTTCC SM:8012195
@RG ID:QCMG:af31b27c-fef1-11e4-b1af-e17eb5105e18:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC.2 130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC.1 LB:RNA-Seq:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC:RNA-Seq:QCMG:Illumina TruSeq for Library_20130620_T PL:ILLUMINA PM:Illumina HiSeq 2000 PU:QCMG:QCMG:130723_7001407_0116_AC2AEVACXX.lane_7.AGTTCC SM:8012195
Field | Value |
---|---|
library_name | Exists in "LB" tag in header above |
platform_unit | Exists in "PU" tag in header above |
read_length_r1 | |
read_length_r2 | |
sample_barcode | Missing |
read_group_id_in_bam | Exists in "@RG" tag in header |
submitterread group_id |
Example metadata for a BAM file in Collab (https://dcc.icgc.org/repositories/files/FI803956):
Metadata about file in Portal: https://dcc.icgc.org/api/v1/repository/files/FI803956 Metadata from XML file: https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb1c0351-e974-5cbe-a53d-b35f2cd9d61f
Using ARGO RNA-Seq Metadata Fields Definition:
Field | Value | Comment | |
---|---|---|---|
studyId | PACA-CA | ||
analysisType | "rna_sequencing_experiment" by default | ||
submitterSampleId | MPCC_0008_Pa_C | ||
matchedNormalSubmitterSampleId | Missing | It's missing in the Portal too ("Matched Control Sample ID" is empty in "Donor" table at https://dcc.icgc.org/repositories/files/FI803956) | |
sampleType | Assuming it would be "Total RNA" | ||
submitterSpecimenId | MPCC_0008_Pa_C | ||
tumourNormalDesignation | Tumour | ||
specimenTissueSource | Other | ||
specimenType | Cell line - derived from tumour | ||
gender | Missing in metadata | Would need to use Portal API to get gender information | |
submitterDonorId | ICGC_0007 | ||
fileName | 00e67dcfa0019557073eb0fbf84f8156.SWID_218778_MPCC_0008_Pa_C_PE_263_MR_120529_SN7001205_0082_BC0W00ACXX_NoIndex_L001_R2_001.fastq.gz | ||
fileSize | 21330847733 | in Portal metadata file | |
fileMd5sum | 00e67dcfa0019557073eb0fbf84f8156 | in Portal metadata file | |
fileType | FASTQ | in Portal metadata file | |
fileAccess | controlled | "access" field in Portal metadata file | |
dataType | "Submitted Reads" by default | ||
data_category | "Sequencing Reads" by default | ||
experimental_strategy | RNA-Seq | ||
library_isolation_protocol | Missing | ||
library_preparation_kit | Missing | ||
library_stranded | Missing | ||
rin | Missing | ||
dv200 | Missing | ||
spike_ins_included | Missing | ||
spike_ins_fasta | Missing | ||
spike_ins_concentration | Missing | ||
platform | ILLUMINA | ||
platform_model | Illumina HiSeq 2500 | "Platform" in XML metadata | |
sequencing_center | OICR_ICGC | "EXPERIMENT center_name="OICR_ICGC"" in XML metadata | |
sequencing_date | 2012-05-29T00:00:00 | "analysis_date" in XML metadata | |
submittersequencing experiment_id | |||
read_group_count | |||
file_r1 | |||
file_r2 | |||
insert_size | 263 | Indicated by PAIRED NOMINAL LENGTH tag in https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb52bea7-1f50-5651-9c4d-a56c59aed4b0 |
|
is_paired_end | Yes | Indicated by PAIRED NOMINAL LENGTH tag in https://dcc.icgc.org/api/v1/ui/collaboratory/metadata/bb52bea7-1f50-5651-9c4d-a56c59aed4b0 |
Blocked on DACO access.
DACO Access has been granted back to Hardeep. This is now unblocked.
Preliminary PACA-AU RNA-Seq intermediate SONG payloads at: https://github.com/icgc-argo/argo-meta/tree/paca-au_rna-seq/icgc_song_payloads/APGI-AU/RNA-Seq/batch1
Summary of issues: https://github.com/icgc-argo/argo-meta/issues/37
Data transfer to Collab is complete. This ticket can be closed. Next step is to imiport data into Song.