sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

A bug in Ray v2.0.0-rc7 creates duplicated contigs in rare cases #55

Closed sebhtml closed 12 years ago

sebhtml commented 12 years ago

The problem is that there are some sequences duplicated in some contigs sometimes.

Running Ray on problematic datasets with -debug-fusions shows that this is a problem after redundant graph paths are available.

(psl output) 37545 0 0 0 0 0 0 0 - contig-1000002 70693 33148 70693 contig-6000017 332701 295156 332701 1 37545, 0, 295156, The alignment: contig-1000002 70693-33149 length=70693 contig-6000017.fasta 295157-332701 length=332701 ``` <------------------------------------------------ ``` -------------------------------------------------------------> So they share 37545, which is unlikely in a genome to be perfectly maintained. The problem seems to be that the offsets on paths are not correctly defined sometimes. From the Ray output: test8-b0ec943bad120ee539d49934f471a8a1f3265f00.1.02:JoinerWorker hit selfPath= 1000002 selfStrand=1 selfLength= 70663 MinSelf=0 MaxSelf=66208 Path=6000017 matches= 37564 length= 332671 minPosition= 295156 maxPosition= 325313 Because of that, these paths are not grouped together...

Reported-by: Mitchell Stanton-Cook m.stantoncook@gmail.com Reported-by: Maxime Déraspe maximilien1er@GMAIL.COM

sebhtml commented 12 years ago

The solution is to still don't store all probed pairs because that would be awful. However, I added a state containing the number of matches for any peer path and also the last probed position for any peer path. The Ray council of wise algorithmicians was consulted to devise a robust solution.

sebhtml commented 12 years ago

7c52141ec786b26adb2eadde340bb84e4a21a973