hall-lab / svtyper

Bayesian genotyper for structural variants
MIT License
126 stars 55 forks source link

SVtyper with long reads #11

Open timp0 opened 9 years ago

timp0 commented 9 years ago

So - I'm trying to generate SV frequencies from long read data (ONT) aligned with a split-read aligner (BWA MEM) using SVtyper, but I keep getting an error:

Error: Traceback (most recent call last): File "/home/timp/Code/svtyper/svtyper", line 1078, in sys.exit(main()) File "/home/timp/Code/svtyper/svtyper", line 1070, in main args.debug) File "/home/timp/Code/svtyper/svtyper", line 767, in sv_genotype sample = Sample(bam_list[i], spl_bam_list[i], num_samp) File "/home/timp/Code/svtyper/svtyper", line 709, in init self.name = bam.header['RG'][0]['SM'] KeyError: 'RG'

Command:

BWAMEM

~/Code/bwa/bwa mem -x ont2d /mithril/Data/NGS/Reference/human/hg19.fa ${fastq}_BC${bc}.fastq.gz >${outdir}/${prefix}.bwa.sam
samtools view -b -S ${outdir}/${prefix}.bwa.sam | samtools sort - ${outdir}/${prefix}.bwa
samtools index ${outdir}/${prefix}.bwa.bam

##From LUMPY readme                                                                                                                                                                                                                                                       

samtools view -b -F 1294 -S ${outdir}/${prefix}.bwa.sam >${outdir}/${prefix}.bwa.discordants.unsorted.bam

# Extract the split-read alignments                                                                                                                                                                                                                                       
samtools view -h -S ${outdir}/${prefix}.bwa.sam \
    | ~/Code/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > ${outdir}/${prefix}.bwa.splitters.unsorted.bam

# Sort both alignments                                                                                                                                                                                                                                                    
samtools sort ${outdir}/${prefix}.bwa.discordants.unsorted.bam ${outdir}/${prefix}.bwa.discordants
samtools sort ${outdir}/${prefix}.bwa.splitters.unsorted.bam ${outdir}/${prefix}.bwa.splitters

samtools index ${outdir}/${prefix}.bwa.splitters.bam

~/Code/lumpy-sv/bin/lumpy \
    -mw 4 \
    -tt 0 \
    -sr id:${prefix}.bwa,bam_file:${outdir}/${prefix}.bwa.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > ${outdir}/${prefix}.bwa.vcf

cat ${outdir}/${prefix}.bwa.vcf | \
    ~/Code/svtyper/svtyper \
    -B ${outdir}/${prefix}.bwa.bam \
    -S ${outdir}/${prefix}.bwa.splitters.bam \
    > ${outdir}/${prefix}.bwa.gt.vcf
cc2qe commented 9 years ago

I have no idea how svtyper will react to long-reads in general, but it looks like that problem is a result of improper read groups in your BAM. The easiest way to add readgroup ID and SM information is with bamaddrg

timp0 commented 9 years ago

Ok - I have added readgroup ID and SM information by adding an -R tag to my bwa command, but now I get a different error: "IndexError: list index out of range"

Error:

Traceback (most recent call last): File "/home/timp/Code/svtyper/svtyper", line 1078, in sys.exit(main()) File "/home/timp/Code/svtyper/svtyper", line 1070, in main args.debug) File "/home/timp/Code/svtyper/svtyper", line 767, in sv_genotype sample = Sample(bam_list[i], spl_bam_list[i], num_samp) File "/home/timp/Code/svtyper/svtyper", line 734, in init self.lib_dict[name].calc_insert_hist() File "/home/timp/Code/svtyper/svtyper", line 689, in calc_insert_hist med = median(valueCounts) File "/home/timp/Code/svtyper/svtyper", line 567, in median v = valueList[i] IndexError: list index out of range

Code:

BWAMEM

~/Code/bwa/bwa mem -R "@RG\tID:${fastq}_BC${bc}\tSM:BC${bc}\tPL:ONT" -x ont2d /mithril/Data/NGS/Reference/human/hg19.fa ${fastq}_BC${bc}.fastq.gz >${outdir}/${prefix}.bwa.sam
samtools view -b -S ${outdir}/${prefix}.bwa.sam | samtools sort - ${outdir}/${prefix}.bwa
samtools index ${outdir}/${prefix}.bwa.bam

##From LUMPY readme                                                                                                                                                                                                                                                       

samtools view -b -F 1294 -S ${outdir}/${prefix}.bwa.sam >${outdir}/${prefix}.bwa.discordants.unsorted.bam

# Extract the split-read alignments                                                                                                                                                                                                                                       
samtools view -h -S ${outdir}/${prefix}.bwa.sam \
    | ~/Code/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > ${outdir}/${prefix}.bwa.splitters.unsorted.bam

# Sort both alignments                                                                                                                                                                                                                                                    
samtools sort ${outdir}/${prefix}.bwa.discordants.unsorted.bam ${outdir}/${prefix}.bwa.discordants
samtools sort ${outdir}/${prefix}.bwa.splitters.unsorted.bam ${outdir}/${prefix}.bwa.splitters

samtools index ${outdir}/${prefix}.bwa.splitters.bam

~/Code/lumpy-sv/bin/lumpy \
    -mw 4 \
    -tt 0 \
    -sr id:${prefix}.bwa,bam_file:${outdir}/${prefix}.bwa.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > ${outdir}/${prefix}.bwa.vcf

cat ${outdir}/${prefix}.bwa.vcf | \
    ~/Code/svtyper/svtyper \
    -B ${outdir}/${prefix}.bwa.bam \
    -S ${outdir}/${prefix}.bwa.splitters.bam \
    > ${outdir}/${prefix}.bwa.gt.vcf
cc2qe commented 9 years ago

Ok, this may be a larger problem. Are you reads single end or paired end? SVTyper needs paired reads to calculate an insert size distribution

timp0 commented 9 years ago

They are single end reads. I put (perhaps naively) the fully aligned bam into the -B, and the split reads (extracted with "extractSplitReads_BwaMem" from lumpy) into -S

What I'm trying to get out is the frequency of SV versus a background wild-type from a cancer sample - why do I need paired end? From a quick look at your code, it looks like you are calculating coverage in a region and determining number of breakpoints detected vs. depth of coverage?

cc2qe commented 9 years ago

Right, that's an accurate description of the tool, but unfortunately SVTyper will not work for your data without modifications. It expects paired-end reads because it was designed for Illumina short data, but you could probably fork it to operate on single-ends alone by removing the calls to calc_insert_hist() and count_pairedend()

timp0 commented 9 years ago

I actually plan to just be simpler than that for now - and use bedtools just to estimate coverage in the region - I'm using amplicons so I know where I'm looking.

That said, it may be an advantage given the recent work on long single read stuff (PacBio, ONT) to have an option or the ability to deal with this in SVtyper? Especially as long read offers many advantages for SV identification/typing.

W

On Tue, Jun 9, 2015 at 3:34 PM, Colby Chiang notifications@github.com wrote:

Right, that's an accurate description of the tool, but unfortunately SVTyper will not work for your data without modifications. It expects paired-end reads because it was designed for Illumina short data, but you could probably fork it to operate on single-ends alone by removing the calls to calc_insert_hist() and count_pairedend()

— Reply to this email directly or view it on GitHub https://github.com/cc2qe/svtyper/issues/11#issuecomment-110478621.

Winston Timp (410)-417-8467 Assistant Professor of Biomedical Engineering Johns Hopkins University

osowiecki commented 4 years ago

Ok, this may be a larger problem. Are you reads single end or paired end? SVTyper needs paired reads to calculate an insert size distribution

Same error [IndexError: list index out of range] with paired-end data. I'm trying to annotate structural variants vcf from "smrtsv2" (PacBio) with illumina reads for same sample aligned with BWA-mem.

Second sample also throws : ValueError: invalid literal for int() with base 10: '\xf0_\x05\x0cM\xb3\x02\x1c\xd9\x95\xc7'