lehtiolab / proteogenomics-analysis-workflow

IPAW: a Nextflow workflow for proteogenomics
25 stars 9 forks source link

update IPAW to hg38 genome #3

Open yafeng opened 6 years ago

yafeng commented 6 years ago

The current IPAW pipeline utilises hg19 genome based databases, and the reported coordinates of novel peptides and SAAV peptides are hg19 genomic coordinates. The goal is to make IPAW compatible for latest hg38 genome assembly.

yafeng commented 6 years ago
  1. Find corresponding hg38 resources - conservation bigWig files.

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons100way/hg38.phastCons100way.bw

https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF+1.bw https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF+2.bw https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF+3.bw https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF-1.bw https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF-2.bw https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg38/latest/PhyloCSF-3.bw

yafeng commented 6 years ago
  1. COSMIC and dbSNP in hg38 version Get the SNP database https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=682621893_rZAeDI3qkmv2ea9OULNeBo6GjEui&clade=mammal&org=&db=hg38&hgta_group=varRep&hgta_track=snp150Common&hgta_table=snp150CodingDbSnp&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=snp150CodingDbSnp.txt

Get the COSMIC database sftp 'your_email_address@example.com'@sftp-cancer.sanger.ac.uk Download the data

sftp> get cosmic/grch38/cosmic/v85/CosmicMutantExport.tsv.gz
sftp> exit
yafeng commented 6 years ago
  1. Get the hg38 masked genome sequence
    wget hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFaMasked.tar.gz
    tar hg38.chromFaMasked.tar.gz
    for chr in {1..22} X Y M; do cat chr$chr.fa.masked >> hg38.chr1-22.X.Y.M.fa.masked; done
yafeng commented 6 years ago
  1. varDB 2.0 database with latest pseudogene, lncRNA, nsSNPs and COSMIC DB aiming to include: a. GENCODE release 28 pseudogenes including consensus pseudogenes predicted by the Yale and UCSC pipelines b. lncRNAs from LNCipedia v5.1 (hg38) c. mutant peptides derived from somatic mutations in COSMIC v85 d. mutant peptides derived from nsSNPs in dbSNP150
yafeng commented 6 years ago

varDB2.0 database can be downloaded from: wget http://lehtiolab.se/Supplementary_Files/VarDB2.zip

yafeng commented 6 years ago

Add a command-line option --hg19 or --hg38 so that the workflow can be run under different genome assembly. The following processes need to be modified accordingly: BLATnovel, phastcons, phyloCSF , annovar

yafeng commented 5 years ago

I made a copy of ipaw for hg38 genome. https://github.com/yafeng/proteogenomics-analysis-workflow/commit/839f18545053145ffca8b36723811f691beb6578

TnakaNY commented 4 years ago

Could you provide a copy of ipaw for hg38 genome? Or, latest version is for hg38?

TnakaNY commented 4 years ago

Hi Yafeng,

Could you please also upload varDB2.0 anywhere? I could not download by using your suggestion above.

Thx.

yafeng commented 4 years ago

@TnakaNY try this link for VarDB2.0 https://drive.google.com/open?id=1G20qIF60xdJ5zrSbt8a8sd0RKutYxQMC

you need to use ipaw hg38 version, which I uploaded under my github repo. And you need to use conda to set up local environments so that all executive commands can be found. It take some efforts to set up. Otherwise, I suggest you continue to use hg19, which is better maintained.

https://github.com/yafeng/proteogenomics-analysis-workflow/blob/master/ipaw.local.hg38.nf

TnakaNY commented 4 years ago

Thank you, let me try!