jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
239 stars 82 forks source link

Error: Duplicate read ID - sga-bam2de.pl #136

Open a-lud opened 7 years ago

a-lud commented 7 years ago

Hi,

I'm trying to build scaffolds from three matepair libraries with 3kb, 5kb and 8kb inserts (currently in BAM format). The person who generated these libraries has followed all steps involved from the example scripts you have provided up to the scaffolding stage.

When running the sga-bam2de.pl function on the libraries, using the same settings as the "Scaffolding multiple libraries" wiki page, an error message similar to the one below is generated for each of the three files, with only the duplicate read ID being different.

abyss-fixmate -h KLS0691b.matepair.3kb.sorted.tmp.hist /localscratch/path/to/data/pe/KLS0691b.matepair.5kb.sorted.bam | samtools view -Sb - > KLS0691b.matepair.3kb.sorted.diffcontigs.bam
error: duplicate read ID `HWI-ST1408:124:CA3J7ANXX:4:1309:4959:8862/1'
[samopen] SAM header is present: 2455895 sequences.
[sam_read1] reference 'ID:bwa   PN:bwa  VN:0.7.13-r1126 CL:bwa mem -t 8 bwa_contigs1_index/index ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R1_t.fastq.gz ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R2_t.fastq.gz
contig-1172471  LN:289
@SQ     SN:contig-1223316       LN:242
@SQ     SN:contig-9458!' is recognized as '*'.
[main_samview] truncated file.
awk '$2 >= 3' KLS0691b.matepair.3kb.sorted.tmp.hist > KLS0691b.matepair.3kb.sorted.hist
awk: cmd. line:1: fatal: cannot open file `KLS0691b.matepair.3kb.sorted.tmp.hist' for reading (No such file or directory)
samtools sort KLS0691b.matepair.3kb.sorted.diffcontigs.bam KLS0691b.matepair.3kb.sorted.diffcontigs.sorted
DistanceEst -s 200 --mind -99 -n 5 -k 99 -j 1 -o KLS0691b.matepair.3kb.sorted.de KLS0691b.matepair.3kb.sorted.hist -l 100 KLS0691b.matepair.3kb.sorted.diffcontigs.sorted.bam
error: the histogram `KLS0691b.matepair.3kb.sorted.hist' is empty

It seems the duplicate read ID is what's triggering the error, however I am unsure how to go about solving this issue. Any help or insight would be appreciated.

Cheers

brisk022 commented 7 years ago

I have the same problem. I fixed it by removing all secondary or supplementary alignments from the bam files, e.g. samtools view -h -F 0x800 -o filtered.bam input.bam (or -F 0x100 if you used -S flag when aligning with BWA).

I didn't dig too deeply, but it seems to me the following is the problem. When only one of the reads has a secondary or supplementary alignment, abyss-fixmate thinks that it is a primary alignment and reports it as a duplicate.

You can try reporting it at https://github.com/bcgsc/abyss

a-lud commented 7 years ago

I came across a similar solution in this google groups thread. It was the secondary/supplementary alignments causing the problem.