maickrau / GraphAligner

MIT License
256 stars 30 forks source link

src/ReadCorrection.cpp:44: Assertion 'i == 0 || corrections[i].startIndex >= corrections[i-1].startIndex' failed. Read: read-name. Seed: 0+,0,0,0 #42

Closed jelber2 closed 3 years ago

jelber2 commented 3 years ago

Hi,

I was trying GraphAligner to error correct some PacBio CLR reads using a non-standard graph that I made with prophasm using it's simplitigs concept (https://github.com/prophyle/prophasm) instead of using bcalm2 (just to try how simplitigs might compare to unitigs). I know that prophasm was not designed for this, but I wanted to try it with -k31 and Illumina paired-end reads that I concantenated into stdin and read into prophasm.

Prophasm command (I know that it is not taking advantage of paired-end reads here)

seqtk -A <(zcat R1.fastq.gz R2.fastq.gz) | \ # use seqtk (https://github.com/lh3/seqtk) to convert gunzipped concatenated fastqs to fasta
prophasm -k 31 -i - -o simplitigs.fa # use prophasm to make simplitigs with k=31 on stdin

With GraphAligner a lot of reads are discarded according to the size of subreads.corr.fasta and stderr. prophasm will generate just a FASTA file, so I merely converted the FASTA to GFA as only GFA "S Lines".

seqtk seq -l 0 simplitigs.fa | \ # remove line wrapping of FASTA sequences
perl -pe "s/>//g" |\ # get rid of ">" in FASTA header
paste - - | \ # put FASTA header and FASTA sequence on the same line
awk '{print "S",$1,$2;}' OFS='\t'| \ # make into a proper GFA "S Line" format
sed '1i H\tVN:Z:1.0' > simplitigs.gfa # add GFA header

GraphAligner Command

GraphAligner -t 24 -g simplitigs.gfa \
-f subreads.fasta \
--corrected-out subreads.corr.fasta \
 --preset dbg > graphaligner.stdout 2>graphaligner.stderr

graphaligner.stdout

GraphAligner bioconda 1.0.13-
Load graph from simplitigs.gfa
Build alignment graph
Build minimizer seeder from the graph
Minimizer seeds, length 19, window size 30, density 5
Seed cluster size 1
Extend up to best 0.002 fraction of seeds
Alignment bandwidth 5, tangle effort 10000
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
write corrected reads to subreads.corr.fasta
Align
Alignment finished
Input reads: 1773715 (24832304602bp)
Seeds found: 6121046451
Seeds extended: 58322419
Reads with a seed: 1758516 (24828679773bp)
Reads with an alignment: 1758274 (19447754506bp)
Alignments: 36406405 (33271777321bp) (21916285 additional alignments discarded)
End-to-end alignments: 27894 (51991976bp)
Alignment broke with some reads. Look at stderr output.

graphaligner.stderr

GraphAligner bioconda 1.0.13-
src/ReadCorrection.cpp:44: Assertion 'i == 0 || corrections[i].startIndex >= corrections[i-1].startIndex' failed. Read: SRR7637702.23 23/1. Seed: 0+,0,0,0
previous line with different read number repeated about 1 miilion times

My question is if there is anyway to recover the reads using some setting in GraphAligner while it is is running other than parsing stderr

jelber2 commented 3 years ago

I just wanted to add that it does not take too terribly long to just parse the stderr

Commands to add missing "uncorrected" reads after running GraphAligner

grep "Seed" graphaligner.stderr|perl -pe "s/.+(SRR\S+ \S+)\. Seed:.+/\1/" > uncorrected-reads

# filterbyname.sh from BBMAP 38.90 https://sourceforge.net/projects/bbmap/
filterbyname.sh in=subreads.fasta names=uncorrected-reads include=t out=subreads.corr2.fasta ow=t > filterbyname.log 2>&1
cat subreads.corr.fasta subreads.corr2.fasta > subreads.corr3.fasta
maickrau commented 3 years ago

This crashed because of a bug in read correction introduced in v1.0.13. Commit 02c8e26 should fix this.

jelber2 commented 3 years ago

Ok it worked now.