tecangenomics / nudup

NuDup -- Marks/removes duplicate molecules based on the molecular tagging technology used in Tecan products.
http://www.tecangenomics.com
GNU Lesser General Public License v3.0
14 stars 9 forks source link

Paired end deduping fails instantly #1

Closed mjafin closed 8 years ago

mjafin commented 8 years ago

Hi there, Thanks for releasing this code out there, much appreciated. The code works OK for single end reads but for paired end reads I get the following error right away:

2016-05-11 08:23:22,114 [     INFO] - Deduplicating NuGEN Ovation Target Enrichment paired end reads...
2016-05-11 08:23:22,114 [     INFO] - Processing sorted SAM/BAM with N6 sequence in qname (assumes sorted)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
2016-05-11 08:23:22,330 [    ERROR] -

Usually once it's been through a few thousand reads.

In the output the last read is this

NB501188:16:H775JAFXX:4:21610:6568:17982:ACTCTT 369     chr1    1301249 0       71H31M41H       chr2    215386987       0
       TGGGGGTGTCTTTTTTTTTTTTTTTTTTTTT /////6//A/AEEE6EEEEEEEEEEEEEEEE NM:i:0  MD:Z:31 AS:i:31 XS:i:30 SA:Z:chr6,47380982,+,54M89S,13,2;chr5,20555409,-,24M3I26M90S,0,4;       XA:Z:chr18,-3448709,72S30M41S,0;chr1,-77922997,74S29M40S,0;chr12,+2673466,43S27M73S,0;  RG:Z:DNA02_S1_umi

In the input the read following the above is

NB501188:16:H775JAFXX:1:11104:20249:12726:TTCTCT        163     chr1    1310768 3       40S39M  =       1310768 39      CTGCAAGGATCGCTCTCTTCAGGGCAGGTGGGGTGAGAATTGGGCGGCTCTGGGTCACAGGTACGGAGGATGACGGCTG AAAAAEEEEEEE<AAEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEA<EEEEE<EE/EEEE/EEEE<E/E<EEE/EE NM:i:0  MD:Z:39 AS:i:39 XS:i:38 SA:Z:chr1,47219626,-,27S36M16S,3,0;     XA:Z:chr1,+1310821,41S38M,0;    RG:Z:DNA02_S1_umi

I don't know if there is anything special about these two reads

shuelga commented 8 years ago

Hello @mjafin, Can you please provide us with the command you used to generate the BAM file (ie: your alignment command)? Thanks!

mjafin commented 8 years ago

Here is the whole command:

bwa mem  -c 250 -M -t 8  -R '@RG\tID:DNA38_S4_umi\tPL:illumina\tPU:4_2016-05-10_bcbio_align\tSM:DNA38_S4_umi' -v 1 /ngs/reference_data/genomes/Hsapiens/hg38/bwa/hg38.fa /ngs/oncology/datasets/NextSeq500/TS_UK_0013_Nugen/Unalign/DNA38_S4_umi_R1_001.fastq.gz /ngs/oncology/datasets/NextSeq500/TS_UK_0013_Nugen/Unalign/DNA38_S4_umi_R2_001.fastq.gz  | /apps/bcbio-nextgen/0.9.7/rhel6-x64/anaconda/share/bwakit-0.7.12-0/k8 /apps/bcbio-nextgen/0.9.7/rhel6-x64/anaconda/share/bwakit-0.7.12-0/bwa-postalt.js -p /ngs/oncology/analysis/translation/TS_UK_0013_Nugen/bcbio_align/work/align/DNA38_S4_umi/hla/DNA38_S4_umi-sort.bam.hla /ngs/reference_data/genomes/Hsapiens/hg38/bwa/hg38.fa.alt | /apps/bcbio-nextgen/0.9.7/rhel6-x64/galaxy/../anaconda/bin/samtools sort -@ 8 -m 1G  -T /ngs/oncology/analysis/translation/TS_UK_0013_Nugen/bcbio_align/work/align/DNA38_S4_umi/tx/tmp7QZZvY/DNA38_S4_umi-sort-sorttmp -o /ngs/oncology/analysis/translation/TS_UK_0013_Nugen/bcbio_align/work/align/DNA38_S4_umi/tx/tmp7QZZvY/DNA38_S4_umi-sort.bam /dev/stdin

Note that I also tried hg19 so this is not a problem caused by bwakit/hg38

mjafin commented 8 years ago

I managed to rewrite the script using PySam (instead of calls to samtools via shell) so I'm all set for now but you may still want to look into this.

If you want I can make a pull request to deposit the pysam version in your repo.

shuelga commented 8 years ago

The main issue is that you are allowing multi-mapped reads. The tool does not allow for reads that are secondary alignments. You will see in the read you posted above that the flag is 369, which is a secondary alignment. You can either change your alignment command to have -c 1, I believe for unique alignments only, or filter your BAM to remove any secondary alignments. I'm curious what you changed though to make it work?