icgc-argo / workflow-roadmap

Roadmap and management for genomic data processing
GNU Affero General Public License v3.0
1 stars 0 forks source link

ICGC25k-PCAWG data migrate to OICR Isilon #379

Closed lindaxiang closed 1 year ago

lindaxiang commented 1 year ago

As noted in ticket, we have audited all the ICGC25k data in Collab.

This the summary of ICGC25k-PCAWG Non-US data across various repository: https://docs.google.com/spreadsheets/d/184EVudu9H59RD14zDwQLt2sx1rMHtfvTAXc5sLAv4p8/edit#gid=942384972

Data Type Total #files Collab only / File size(GB) Collab/AWS/Azure/EGA EGA Missing
RNA-Seq-BAM 2207 0 2207 2179 0
WGS-BAM 7560 0 7560 7560 0
WGS-VCF 85669 22835 / 129 GB 62834 0 0
WGS-minibam 7560 1990 / 341GB 4862 0 708
Validation-BAM 32 32 0 0
Pilot50-VCF 48 48 0 0

Note: The files status only consider the data of PCAWG donors from Whitelist and Graylist in the latest release (May 2016)

With Collaboratory shutting down, we will migrate all Collab-only Non-US PCAWG files to OICR Isilon storage.

PCAWG Non-US files need to be copied are:

lindaxiang commented 1 year ago

Considering that Azure and AWS may be also retired in mid-2024, that leaves EGA to be the only long term repository. We decided to make a copy for all files of PCAWG ICGC portion (which do not have a EGA copy) to OICR isilon. These will include:

To get these into the PCAWG release folder, we need to copy the files to the OICR Isilon storage directory (Instructions: https://wiki.oicr.on.ca/display/icgcargotech/Copying+Files+to+Isilon).

lindaxiang commented 1 year ago

The following controlled tier files are ready to be transferred from dcc-proxy to the Portal:

operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/rnaseq_aligned_bams https://dcc.icgc.org/releases/PCAWG/rnaseq_aligned_bams
add PCAWG.RNA-Seq.icgc.aligned_bam.metadata.txt /nfs/hadoop/workspace/pcawg/rnaseq_aligned_bams https://dcc.icgc.org/releases/PCAWG/rnaseq_aligned_bams
add PCAWG.RNA-Seq.ESAD-UK.controlled.access (folder) /nfs/hadoop/workspace/pcawg/rnaseq_aligned_bams https://dcc.icgc.org/releases/PCAWG/rnaseq_aligned_bams
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG.WGS.icgc.broad.metadata.txt /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG_BROAD.germline.indel.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG_BROAD.germline.sv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG_BROAD.somatic.indel.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG_BROAD.somatic.snv_mnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
add PCAWG_BROAD.somatic.sv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/broad_calls https://dcc.icgc.org/releases/PCAWG/broad_calls
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG.WGS.icgc.dkfz_embl.metadata.txt /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.germline.indel.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.germline.snv_mnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.germline.sv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.somatic.cnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.somatic.indel.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.somatic.snv_mnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
add PCAWG_DKFZ_EMBL.somatic.sv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/dkfz_embl_calls https://dcc.icgc.org/releases/PCAWG/dkfz_embl_calls
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
add PCAWG.WGS.icgc.sanger.metadata.txt /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
add PCAWG_SANGER.somatic.cnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
add PCAWG_SANGER.somatic.indel.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
add PCAWG_SANGER.somatic.snv_mnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
add PCAWG_SANGER.somatic.sv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/sanger_calls https://dcc.icgc.org/releases/PCAWG/sanger_calls
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/muse_calls https://dcc.icgc.org/releases/PCAWG/muse_calls
add PCAWG.WGS.icgc.muse.metadata.txt /nfs/hadoop/workspace/pcawg/muse_calls https://dcc.icgc.org/releases/PCAWG/muse_calls
add PCAWG_MUSE.somatic.snv_mnv.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/muse_calls https://dcc.icgc.org/releases/PCAWG/muse_calls
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/pilot50_calls https://dcc.icgc.org/releases/PCAWG/pilot50_calls
add PCAWG.Pilot50.icgc.vcf.metadata.txt /nfs/hadoop/workspace/pcawg/pilot50_calls https://dcc.icgc.org/releases/PCAWG/pilot50_calls
add PCAWG_Pilot50.somatic.mutation.icgc.controlled.tgz /nfs/hadoop/workspace/pcawg/pilot50_calls https://dcc.icgc.org/releases/PCAWG/pilot50_calls
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/validation_bams https://dcc.icgc.org/releases/PCAWG/validation_bams
add PCAWG.Validation.icgc.aligned_bam.metadata.txt /nfs/hadoop/workspace/pcawg/validation_bams https://dcc.icgc.org/releases/PCAWG/validation_bams
add PCAWG.Pilot50.validation_bam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/validation_bams https://dcc.icgc.org/releases/PCAWG/validation_bams
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/wgs_aligned_bams https://dcc.icgc.org/releases/PCAWG/wgs_aligned_bams
add PCAWG.WGS.icgc.aligned_bam.metadata.txt /nfs/hadoop/workspace/pcawg/wgs_aligned_bams https://dcc.icgc.org/releases/PCAWG/wgs_aligned_bams
operation file/folder path on dcc-proxy path on portal
add README.md /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PCAWG.WGS.icgc.minibam.metadata.txt /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add BOCA-UK.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add BRCA-EU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add BRCA-UK.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add BTCA-SG.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add CLLE-ES.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add CMDI-UK.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add EOPC-DE.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add ESAD-UK.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add GACA-CN.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add LAML-KR.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add LICA-FR.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add LINC-JP.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add LIRI-JP.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add MALY-DE.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add MELA-AU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add ORCA-IN.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add OV-AU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PACA-AU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PACA-CA.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PAEN-AU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PAEN-IT.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PBCA-DE.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PRAD-CA.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add PRAD-UK.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
add RECA-EU.minibam.icgc.controlled.access (folder) /nfs/hadoop/workspace/pcawg/minibams https://dcc.icgc.org/releases/PCAWG/minibams
lindaxiang commented 1 year ago

Hit a blocker that the staging area in dcc-proxy.res.oicr.on.ca:/nfs/hadoop/workspace/pcawg runs out of space.

lindaxiang commented 1 year ago

Thanks to Jared's kind help. The blocker was removed.

jmimico commented 1 year ago

Hit another blocker. This time the quota on the hadoop fs was tripped @ 10TB. I've asked IT to increase this by 2.5TB. WIll resume transfers then.

jmimico commented 1 year ago

All data has been copied to hadoop and is available at https://dcc.icgc.org/releases/PCAWG/ . Please validate copy when you can.

lindaxiang commented 1 year ago

Double check the copies files on pcawg release folders. All look good to me. Thanks to Jared! The ticket can be closed.

edsu7 commented 1 year ago

Closing!