mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.com/
12 stars 5 forks source link

clean up ngi-genomes folder #1007

Open anoronh4 opened 6 months ago

anoronh4 commented 6 months ago

We have several files in the ngi-igenomes folder on juno that do not actually exist in the remote reference repository, making recreation of this reference folder difficult in any other environment. Many of these paths are listed in the tempo references configuration file. Here's a list of files that are newer than Nov 16, 2018:

$ find $PWD -mtime -1930 -type f -exec ls -l {} \;
-rw-r----- 1 gongy cmopipeline 242018150 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/1000G_phase1.indels.hg19.sites.vcf
-rw-r----- 1 gongy cmopipeline 90196895 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
-rw-r----- 1 gongy cmopipeline 1484596 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx
-rw-r----- 1 gongy cmopipeline 12381528 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/dbsnp_138.hg19.vcf.idx
-rw-r----- 1 gongy cmopipeline 1238920 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/1000G_phase1.indels.hg19.sites.vcf.idx
-rw-r----- 1 gongy cmopipeline 10796220779 Mar 10  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/hg19/dbsnp_138.hg19.vcf
-rw-r--r-- 1 socci cmopipeline 1517 Mar 18  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Annotation/intervals/human.b37.genome.bed
-rw-r--r-- 1 socci cmopipeline 1360930446 Mar  7  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy.fasta.microsatellites.list
-rw-rw-r-- 1 socci cmopipeline 3189750467 Feb 27  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy.fasta
-rw-r--r-- 1 socci cmopipeline 67108864 Apr 22  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy.fasta.index
-rw-r--r-- 1 noronhaa cmopipeline 16854 Jun 29  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/human_g1k_v37_decoy.fasta.dict
-rw-r--r-- 1 noronhaa cmopipeline 1176551519 Jun 30  2022 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/human_g1k_v37_decoy.fasta.gridsscache
-rw-rw-r-- 1 socci cmopipeline 3189750467 Jul  1  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/human_g1k_v37_decoy.fasta
-rw-r--r-- 1 wooh cmopipeline 2813 Jun  7  2021 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/human_g1k_v37_decoy.fasta.fai
-rw-r--r-- 1 socci cmopipeline 9040952644 Mar  5  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/b37/dbsnp_137.b37__RmDupsClean__plusPseudo50__DROP_SORT.vcf
-rw-r--r-- 1 socci cmopipeline 1015019014 Mar  5  2019 /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/b37/dbsnp_137.b37__RmDupsClean__plusPseudo50__DROP_SORT.vcf.gz

some files such as human.b37.genome.bed, human_g1k_v37_decoy.fasta.microsatellites.list and dbsnp_137.b37RmDupsCleanplusPseudo50__DROP_SORT.vcf* can be relocated somewhere outside of the igenomes directory. fasta, fai, and dict files can be cleaned up or ignored from /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/ because i don't believe they are being used by tempo.

some of the vcf files are also unzipped in the juno folder, but on igenomes they only exist as zipped files. this might cause confusion as well.