lehtiolab / proteogenomics-analysis-workflow

IPAW: a Nextflow workflow for proteogenomics
24 stars 9 forks source link

Integrated proteogenomics analysis workflow

Nextflow

install with bioconda Docker

This is a workflow to identify, curate, and validate variant and novel peptides from MS proteomics spectra data, using databases containing novel and variant peptides, such as the VarDB database. VarDB combines entries from COSMIC, PGOHUM, CanProVar and lncipedia. The workflow takes mzML spectra files as input, is powered by Nextflow and runs in Docker or Singularity containers.

Searches are run using MSGF+ on a concatenated target and decoy databases which are then passed to Percolator for statistical evaluation, in which FDR is determined in a class specific manner, filtering out known peptides and dividing novel/variant in different FDR arms. Thereafter a curation procedure is performed in which resulting peptides are evaluated on several different criteria, dependent on the peptide.

Please cite the following paper when you have used the workflow for publications :)

Zhu Y, Orre LM, Johansson HJ, Huss M, Boekel J, Vesterlund M, Fernandez-Woodbridge A, Branca RMM, Lehtio J: Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat Commun 2018, 9(1):903. PMID: 29500430

workflow image

Before running

Detailed pipeline inputs

Prepare once

# Get this repo
git clone https://github.com/lehtiolab/proteogenomics-analysis-workflow
cd proteogenomics-analysis-workflow

# Get Annovar
cd /path/to/your/annovar
wget __link_you_get_from_annovar__
tar xvfz annovar.latest.tar.gz
# This creates a folder with annotate_variation.pl and more files, to be passed to the pipeline with --annovar_dir

# Download bigwigs, this can take some time
cd /path/to/your/bigwigs  # this dir will be passed to the pipeline with --bigwigs
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.bw 
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF+0.bw
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF+1.bw
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF+2.bw
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF-0.bw
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF-1.bw
wget https://data.broadinstitute.org/compbio1/PhyloCSFtracks/hg19/latest/PhyloCSF-2.bw

# In the meantime, download and extract varDB data (Fasta, GTF, BlastP, SNP Fasta) to a good spot
wget -O varDB_data.tar.gz https://ndownloader.figshare.com/files/13358006 
tar xvfz varDB_data.tar.gz

# Get the hg19 masked genome sequence
wget hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFaMasked.tar.gz
tar xvfz chromFaMasked.gz
for chr in {1..22} X Y M; do cat chr$chr.fa.masked >> hg19.chr1-22.X.Y.M.fa.masked; done

# Download ENSEMBL database
wget ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz
gunzip Homo_sapiens.GRCh38.pep.all.fa.gz

# Get the COSMIC database
sftp 'your_email_address@example.com'@sftp-cancer.sanger.ac.uk
# Download the data
sftp> get cosmic/grch37/cosmic/v81/CosmicMutantExport.tsv.gz
sftp> exit
# Extract COSMIC data
tar xvfz CosmicMutantExport.tsv.gz

Analyse your mzML files with VarDB

Example command to search TMT 10-plex labelled data in docker Remove --isobaric parameter if you have label-free data.

nextflow run main.nf --tdb /path/to/VarDB.fasta \
  --mzmldef spectra_file_list.txt \
  --activation hcd \
  --isobaric 'set01:tmt10plex:131 set02:tmt10plex:131' 'set03:tmt10plex:127N' \
  --gtf /path/to/VarDB.gtf \
  --mods /path/to/tmt_mods.txt \
  --knownproteins /path/to/Homo_sapiens.GRCh38.pep.all.fa \
  --blastdb /path/to/UniProteome+Ensembl94+GENCODE24.proteins.fasta \
  --cosmic /path/to/CosmicMutantExport.tsv \
  --snpfa /path/to/MSCanProVar_ensemblV79.filtered.fasta \
  --genome /path/to/hg19.chr1-22.X.Y.M.fa \
  --dbsnp /path/to/snp142CodingDbSnp.txt \
  --annovar_dir /path/to/your/annovar \
  --bigwigs /path/to/your/bigwigs \
  --bamfiles /path/to/\*.bam \
  --outdir /path/to/results \
  -profile standard,docker # replace docker with singularity if needed