sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Overlapping contigs - ratio < 10% #62

Open zorino opened 12 years ago

zorino commented 12 years ago

When 2 contigs overlap and the ratio of that matching region is < 10% Ray won't merge those 2 contigs.

Exemple :

                                      Overlap=9142/9142 (100%)

-----------------------------------------------------> contig-28 length= 197810 <----------------------------------- contig-45 length= 94175

sebhtml commented 12 years ago

Hello Zorino,

The file ParallelPaths.txt contains all the paths that were computed in parallel.

Ray removes the redundancy by eliminating paths included in other longer paths.

But sometimes, paths overlap because path traversability in a non-bidirectional de Bruijn subgraph is a non symmetric property for the algorithms used by Ray. This means that sometimes you can not cross region B starting from region A, but you can cross region B starting from region C (see drawing below)


region A region B region C

In those cases, there will overlapping paths (or contigs). In numerous cases, the overlap is rather long so the 10% rule is not a limiting factor.

But it seems that in your case it is.

If you are skilled in C++, the concerned code is inside:

plugin: JoinerTaskCreator class: JoinerWorker file: code/plugin_JoinerTaskCreator/JoinerWorker.cpp line: 411

Just lowering the threshold will solve the problem, but will not be probably be safe.

In my opinion there should be additional testing to check if the overlap is a repeat.

Some documentation should you want to work on this:

Documentation/CodingStyle.txt Documentation/Submit-a-patch.txt