mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.com/
12 stars 5 forks source link

Viral Integration #102

Open evanbiederstedt opened 5 years ago

evanbiederstedt commented 5 years ago

Here are the viruses we will try:

185 HPV subtypes, all HHV (including EBV), merkel cell, HTLV-1, and hepatitis B

We'll ask Clinical Bioinformatics (Anita) for help.

Method we'll try

Download the FASTAs for these viruses. Then concatenate these FASTAs with hg19. Then, use this reference as the reference for Delly and for Manta.

i.e. we do SV calling with this "special" reference.

Then, look at translocation events TRA (at least this is how it works for Delly). Viral integration sites are called by SV callers as a translocation, because the SV caller thinks the read must be coming from another chromosome, i.e. "a translocation".

evanbiederstedt commented 5 years ago

https://github.com/mskcc/vaporware/blob/develop/somatic.nf#L49-L112

Here's how to do this

delly call -t BND 
-g /juno/work/taylorlab/cmopipeline/mskcc-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy.fasta 
-o output.bcf tumor.bam normal.bam

the flag TRA is officially outdated

allanbolipata commented 5 years ago

@evanbiederstedt The viral integration part is literally just the BND output? I can just publish it from the DellyCall process if we keep --exclude ${svCallingExcludeRegions}

evanbiederstedt commented 5 years ago

The viral integration part is literally just the BND output? I can just publish it from the DellyCall process if we keep --exclude ${svCallingExcludeRegions}

@allanbolipata You'll need to use the special reference FASTA which has viral sequences

I also think it's worth not using --exclude ${svCallingExcludeRegions} here, but @kpjonsson might disagree ferociously.

allanbolipata commented 5 years ago

What is the special reference FASTA? I can add them.

evanbiederstedt commented 5 years ago

What is the special reference FASTA? I can add them.

It's in /juno/work/taylorlab/cmopipeline/mskcc-igenomes/grch37/viral_reference

This needs to be set in the config file for GRCh37 vs. GRCh38

evanbiederstedt commented 5 years ago

CC @allanbolipata

Let's try Manta.

I believe this will work, putting the viral FASTA in --referenceFasta:

${MANTA_INSTALL_PATH}/bin/configManta.py \
--normalBam normal.bam \
--tumorBam tumor.bam \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}

This is an experimental test.

https://github.com/Illumina/manta/blob/master/docs/userGuide/README.md

But let's try a few things:

-- tumor-only

${MANTA_INSTALL_PATH}/bin/configManta.py \
--tumorBam HCC1187C.cram \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}

with the viral FASTA in --referenceFASTA

---Single Diploid Sample Analysis, with the viral FASTA in the argument --referenceFASTA

${MANTA_INSTALL_PATH}/bin/configManta.py \
--bam NA12878_S1.bam \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}

Let's run this configuration on the 25 BRCA samples: https://github.com/mskcc/vaporware/blob/master/test_inputs/lsf/WES_25TN.tsv

kpjonsson commented 5 years ago

We're unlikely to catch any viral integration in any of those samples. For proving that it works we likely need to download some published samples (e.g. TCGA) with viral integration.

allanbolipata commented 5 years ago

@evanbiederstedt Should I not use --exome?

@kpjonsson Do you have TCGA sample bams?

kpjonsson commented 5 years ago

@kpjonsson Do you have TCGA sample bams?

Not at hand. One could try with for example a set of the TCGA liver cancers, where viral integration is common. That being said, it'll still be sparse signal since these are exomes, not genomes. Not sure what to expect.

kpjonsson commented 5 years ago

There are some stomach cancer BAMs in /ifs/tcga/stad/BAMs/, these might work, since there's a fraction of stomach cancers with EBV and they picked these up based on exome data, although with a method different from the one we intend to use.

allanbolipata commented 5 years ago

Error running Manta

ERROR ~ Error executing process > 'RunMantaViralFasta (1)'

Caused by:
  Process `RunMantaViralFasta (1)` terminated with an error exit status (1)

Command executed:

  configManta.py     --exome     --referenceFasta SuperReference.fa     --normalBam normal_sample.sorted.md.bqsr.bam     --tumorBam tumor_sample.sorted.md.bqsr.bam     --runDir Manta

  python Manta/runWorkflow.py     --mode local     --jobs 8

Command exit status:
  1

Command output:
  (empty)

Command error:

  CONFIGURATION ERROR:
  Reference genome mismatch: Reference fasta file is missing a chromosome found in the Normal BAM/CRAM file: 'NC_007605'

Work dir:
  /juno/work/pi/cmopipeline/nextflow/vaporware_executes/executor_1/work/c8/3ff57b97d832a6c0b03b6470918cbc

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

The SuperReference.fa file has >EBVType1.NC_007605.1, though.

Gonna try what's referenced at https://github.com/Illumina/manta/issues/93

gongyixiao commented 5 years ago

This is the email I send back in March 4th:

Just a note of what I have read so far. Keep a record for future use.

Questions:

Useful Links: Initial discussion: https://www.biostars.org/p/227778/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6050683/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673242/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6283451/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4499804/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4333248/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4580395/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2224419/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3754044/

https://www.biorxiv.org/content/biorxiv/early/2017/10/25/208926.full.pdf

ckandoth commented 5 years ago

Since EBV commonly contaminates human DNA, we include it in the standard GRCh37 reference. You can do a samtools idxstats on all Roslin DMP-WES BAMs aligned to date, to find something with an abundance of EBV DNA. Then use that for your SV caller tests to zero in on the integration site.

allanbolipata commented 5 years ago

RE: https://github.com/mskcc/vaporware/issues/102#issuecomment-490531767

It's a reference file issue; gonna wait on a new set of reference files then gonna try a re-run.

kpjonsson commented 5 years ago

Seems to be that SuperReference.fa is built on a different version of the human genome than the one we align against. Maybe that's what you already figured out @allanbolipata?

evanbiederstedt commented 5 years ago

Seems to be that SuperReference.fa is built on a different version of the human genome than the one we align against. Maybe that's what you already figured out @allanbolipata?

@kpjonsson We're way ahead of you, min vän vid Charles floden

We're re-creating the viral FASTA now

allanbolipata commented 5 years ago

@kpjonsson Yeah it's in (JUNO-only) ${params.reference_base}/mskcc-igenomes/grch37/viral_reference/human_g1k_v37_plus_all_viruses.fa

But I ran into another error, which is the inverse of https://github.com/mskcc/vaporware/issues/102#issuecomment-490531767:

  Reference genome mismatch: Normal BAM/CRAM file is missing a chromosome found in the reference fasta file: 'HPVType14D.X74467.1'

Manta doesn't seem to work with a different reference file from the one used to make the BAMs.

evanbiederstedt commented 5 years ago

It appears clinbx uses https://github.com/G100DKFZ/gene-is

This looks promising as well: https://github.com/namphuon/ViFi