sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Duplicated contigs #153

Closed fredericraymond closed 11 years ago

fredericraymond commented 11 years ago

Ran SilverRay with k=61. Obtained one big contig that was duplicated. I don't want duplication of contigs.

For example :

contig-1332000049 contig-1285000051 100.00 130591 1 0 116521 247111 137960 7370 0.0e+00 255155.0 contig-1332000049 contig-1285000051 100.00 115293 1 0 1 115293 254480 139188 0.0e+00 225326.0

This run is found here : /rap/nne-790-ab/projects/Project_CQDM2/CQDM_Run1/Sample_CQDM2-3-61-SilverRay-2013-02-04

sebhtml commented 11 years ago

see #141

sebhtml commented 11 years ago

/home/boisver1/issue-153

sebhtml commented 11 years ago

use blat 34 and

/rap/nne-790-ab/software/blatAligner/LastBuild/blat Contigs.fasta Contigs.fasta self.psl -fastMap

~/git-clones/Ray-TestSuite/scripts/dumpPsl.py

sebhtml commented 11 years ago

filtered PSL:

[boisver1@cp2567 Sample_CQDM2-3-1]$ ~/git-clones/Ray-TestSuite/scripts/dumpPsl.py
33838 0 0 0 0 0 0 0 - contig-1000013 58026 0 33838 contig-4 125302 91454 1252921 33838, 24188, 91454, 132248 2 0 0 8 8 8 8 + contig-2000013 207507 19862 152120 contig-19 139462 0 1322589 275,38,923,36,181,43,130591,26,137, 19862,20138,20177,21101,21138,21320,21364,151956,151983, 0,276,315,1239,1276,1458,1502,132094,132121, 132248 2 0 0 8 8 8 8 + contig-19 139462 0 132258 contig-2000013 207507 19862 1521209 275,38,923,36,181,43,130591,26,137, 0,276,315,1239,1276,1458,1502,132094,132121, 19862,20138,20177,21101,21138,21320,21364,151956,151983, 75123 29 0 0 6 16 6 16 - contig-1000020 106983 31815 106983 contig-21 106317 0 75168 7 74274,26,365,251,50,25,161, 0,74275,74302,74668,74920,74971,75007, 0,74275,74302,74668,74920,74971,75007, 75123 29 0 0 6 16 6 16 - contig-21 106317 0 75168 contig-1000020 106983 31815 1069837 161,25,50,251,365,26,74274, 31149,31321,31347,31398,31650,32016,32043, 31815,31987,32013,32064,32316,32682,32709,

sebhtml commented 11 years ago

probably a duplicate of #62

sebhtml commented 11 years ago

4 odd relations:

contig-1000013 (58026) & contig-4 (125302) 33838--1 & 91455--125292 33838/33838

contig-1000013 (58026) & contig-2 (52572) 33140--58026 & 27686--52572 24887/24887

contig-2000013 (207507) & contig-19 (139462) 21365--151955 & 1503--132093 130590/130591

contig-1000020 (106983) & contig-21 () 106983--32710 & 1--74274 74270/74274

sebhtml commented 11 years ago

needs 2 features in Ray Cloud Browser to understand this faster:

1- link to a location 2- multi-path

sebhtml commented 11 years ago

/home/boisver1/issue-153

sebhtml commented 11 years ago

/rap/nne-790-ab/projects/Ray-Cloud-Browser/issue-153

+++ /rap/nne-790-ab/projects/Ray-Cloud-Browser/issue-153/Sample_CQDM2-3-2013-02-19-1

sebhtml commented 11 years ago

http://browser.cloud.raytrek.com/client/

$ python2.7 ~/git-clones/Ray-TestSuite/scripts/dumpPsl.py 132248 2 0 0 8 8 8 8 + contig-5 207507 19862 152120 contig-51 139462 0 1322589 275,38,923,36,181,43,130591,26,137, 19862,20138,20177,21101,21138,21320,21364,151956,151983, 0,276,315,1239,1276,1458,1502,132094,132121, 28384 0 0 0 0 0 0 0 - contig-43 52572 0 28384 contig-1000096 123145 94761 1231451 28384, 24188, 94761, 132248 2 0 0 8 8 8 8 + contig-51 139462 0 132258 contig-5 207507 19862 1521209 275,38,923,36,181,43,130591,26,137, 0,276,315,1239,1276,1458,1502,132094,132121, 19862,20138,20177,21101,21138,21320,21364,151956,151983, 33848 0 0 0 0 0 0 0 - contig-54 58036 24188 58036 contig-105 125302 0 33848 1 33848, 0, 0, 92022 0 0 0 0 0 0 0 - contig-1000096 123145 3438 95460 contig-105 125302 33149 1251711 92022, 27685, 33149, 92022 0 0 0 0 0 0 0 - contig-105 125302 33149 125171 contig-1000096 123145 3438 95460 1 92022, 131, 3438,

sebhtml commented 11 years ago

lengths are in k-mer positions are 1-based

contig-51 139402 contig-5 207447

contig-5:19863 ....... contig-51:1 http://browser.cloud.raytrek.com/client/?map=3&section=0&region=9&location=0&zoom=1.2255452109421872 contig-5:151294 .......... contig-51:139402 http://browser.cloud.raytrek.com/client/?map=3&section=0&region=9&location=139401&zoom=1.2255452109421872

51is redundant because 5 includes it.

However, blat says 132248 / 139462 nucleotides match.

New link: http://genome.ulaval.ca:10111/client

So the obvious question is where are the differences ?

sebhtml commented 11 years ago

This bug is not really a bug. This happens because the sample is highly polymorphic.

However, it will be fixed regardless by lowering the required matches.

sebhtml commented 11 years ago

However, blat says 132248 / 139462 nucleotides match.

New link: http://genome.ulaval.ca:10111/client

So the obvious question is where are the differences ?

Auto-play link:

http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=16&play=forward&speed=8

Answer:

On http://genome.ulaval.ca:10111/client/

contig-5 is 207k contig-51 is 139k

MUMmer analyses

/home/boiseb01/Hathor/data/Sample_CQDM2-3-2013-02-19-1 http://mummer.sourceforge.net/manual/#snpdetection

/home/boiseb01/Hathor/data/Sample_CQDM2-3-2013-02-19-1/5.fasta /home/boiseb01/Hathor/data/Sample_CQDM2-3-2013-02-19-1/51.fasta NUCMER

[P1]  [SUB]  [P2]      |   [BUFF]   [DIST]  |  [LEN R]  [LEN Q]  | [FRM]  [TAGS]

20138 A T 276 | 39 276 | 207507 139462 | 1 1 contig-5 contig-51 20177 T G 315 | 39 315 | 207507 139462 | 1 1 contig-5 contig-51 21009 T C 1147 | 92 1147 | 207507 139462 | 1 1 contig-5 contig-51 21101 A G 1239 | 37 1239 | 207507 139462 | 1 1 contig-5 contig-51 21138 T C 1276 | 37 1276 | 207507 139462 | 1 1 contig-5 contig-51 21320 T C 1458 | 44 1458 | 207507 139462 | 1 1 contig-5 contig-51 21364 C T 1502 | 44 1502 | 207507 139462 | 1 1 contig-5 contig-51 21462 T G 1600 | 98 1600 | 207507 139462 | 1 1 contig-5 contig-51 151956 A G 132094 | 27 7369 | 207507 139462 | 1 1 contig-5 contig-51 151983 G A 132121 | 27 7342 | 207507 139462 | 1 1 contig-5 contig-51

SNP @ contig-5:20138 and contig-51:276 http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=214&zoom=0.7833022538927872

sebhtml commented 11 years ago

run a job with -debug-fusions 10244720 10244722

sebhtml commented 11 years ago

The 139k contig belongs to Rank 51.

Sample_CQDM2-3-2013-03-14-2.1.051:FusionWorker worker 0 path 51 strand= 0 is Done, analyzed 139402 position length is 139402

FusionWorker path 1000005 matches= 132171 length= 207447

In code/plugin_FusionTaskCreator/FusionWorker.cpp, a maximum of 1024 kmers can be lost.

Here, about 7k are not matching.

But the matches start at contig-51:1 http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=0 They end at contig-51: http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=139401

So the 7k that Ray says that do not match is between the start and the end.

MUMmer analysis with nucmer:

/home/boiseb01/Hathor/data/Sample_CQDM2-3-2013-02-19-1/5.fasta /home/boiseb01/Hathor/data/Sample_CQDM2-3-2013-02-19-1/51.fasta NUCMER

{{{

[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]

19863 152120 | 1 132258 | 132258 132258 | 99.99 | 207507 139462 | 63.74 94.83 | contig-5 contig-51 150618 151354 | 138726 139462 | 737 737 | 98.78 | 207507 139462 | 0.36 0.53 | contig-5 contig-51 }}}

At contig-51:132258 http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=132207&zoom=0.5347674032378408

At contig-51:138726 http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=138829&zoom=0.7486437028696127

How is that even possible ?

It is probably due to seeds that should not exist.

sebhtml commented 11 years ago

Will fix #136, then check if this issue gets fixed.

sebhtml commented 11 years ago

From the first message:

contig-1332000049 contig-1285000051 100.00 130591 1 0 116521 247111 137960 7370 0.0e+00 255155.0 contig-1332000049 contig-1285000051 100.00 115293 1 0 1 115293 254480 139188 0.0e+00 225326.0

/rap/nne-790-ab/projects/Project_CQDM2/CQDM_Run1/Sample_CQDM2-3-61-SilverRay-2013-02-04

contig-1332000049 has 302666 nucleotides contig-1285000051 has 254480 nucleotides

with http://mummer.sourceforge.net/manual/#aligningdraft

/home/boiseb01/issue-153/contig-1285000051.fasta /home/boiseb01/issue-153/contig-1332000049.fasta
NUCMER

    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
    7205   254480  |   247276        1  |   247276   247276  |    99.99  |   254480   302666  |    97.17    81.70  | contig-1285000051  contig-1332000049
sebhtml commented 11 years ago

Trying to reproduce the problem: mp2 /netmount/ip03_home/boisver1/issue-153

Ray fbc3a9c859c72340df3d061de5f8e553326597c9 RayPlatform 7eece8a3cb2eb4e132f76f854f00d3f9640f12da

sebhtml commented 11 years ago

This duplication occurs during the extension of seeds or before

contig-2000013 207507 => 207447 kmers

Rank 13 reached 207447 vertices from seed 91, flow 2

Spawned from:

Rank 13 starts on seed 91, length is 105, flow 0 [91/639]

contig-19 139462 => 139402 kmers

Rank 19 reached 139402 vertices from seed 1, flow 2

Spawned from:

Rank 19 starts on seed 1, length is 6007, flow 0 [1/676]

-k 61

The question is:

What is the seed mode coverage for these extensions.

sebhtml commented 11 years ago

ls30 /home/boiseb01/issue-153

sebhtml commented 11 years ago

/home/boisver1/issue-153/Sample_CQDM2-3-10/logs

contig-2000013 207507 19862 152120 contig-19 139462

=> 207447 objects Rank 13 reached 207447 vertices from seed 91, flow 2 Rank 13 starts on seed 91, length is 105, flow 0 [91/639]

rank = 13 id = 91

Seed # 91000013

=> 139402 objects Rank 19 starts on seed 1, length is 6007, flow 0 [1/676] Rank 19 reached 139402 vertices from seed 1, flow 2

-k 61

sebhtml commented 11 years ago

simply abort the whole thing is the resolution does not allow this to be done.

sebhtml commented 11 years ago

33838 0 0 0 0 0 0 0 - contig-1000013 58026 0 33838 contig-4 125302 91454 125292133838, 24188, 91454, 75123 29 0 0 6 16 6 16 - contig-1000020 106983 31815 106983 contig-21 106317 0 75168 774274,26,365,251,50,25,161, 0,74275,74302,74668,74920,74971,75007, 0,74275,74302,74668,74920,74971,75007, 75123 29 0 0 6 16 6 16 - contig-21 106317 0 75168 contig-1000020 106983 31815 1069837161,25,50,251,365,26,74274, 31149,31321,31347,31398,31650,32016,32043, 31815,31987,32013,32064,32316,32682,32709,

Fixed.

7598d4b2998067aed82e30cc61ad7b530c928d0d