sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Missing region in assembly #101

Closed sebhtml closed 11 years ago

sebhtml commented 11 years ago

Path: /speedSAS/users/boiseb01/for-max

Problem: A region is in the graph, but does not appear in assembly k=31. It appears in assembly k=51.

Agenda:

Reproduce experiments at k=31 and k=51 with latest build

sebhtml commented 11 years ago

make HAVE_LIBZ=y HAVE_LIBBZ2=y -j 10 ASSERT=y PREFIX=/speedSAS/users/boiseb01/for-max/Ray-build

ray a248ac9b572eaecc3f56a5155f50bb4bad62b25f RayPlatform ebfde10907d7cce48ed466dc44192e0bd2a281c4

sample is /rap/nne-790-ab/projects/Paul-H-Roy-pseudomonases/Project_PHR/Sample_Paeru2

The missing region has reads (bwa + samtools) The missing region is not missing at k=51

ref. is pao303

On Colosse: /rap/nne-790-ab/projects/Paul-H-Roy-pseudomonases/Project_PHR/MissingRegion

Working directory (ls28): /speedSAS/users/boiseb01/for-max/seb-work

sebhtml commented 11 years ago

Also a test with this patch:

diff --git a/code/plugin_SeedingData/SeedingData.cpp b/code/plugin_SeedingData/SeedingData.cpp index 79e9b04..74b071c 100644 --- a/code/plugin_SeedingData/SeedingData.cpp +++ b/code/plugin_SeedingData/SeedingData.cpp @@ -135,7 +135,7 @@ void SeedingData::call_RAY_SLAVE_MODE_START_SEEDING(){ Kmer lastVertex=seed[seed.size()-1]; Kmer firstReverse=m_parameters->_complementVertex(&lastVertex);

sebhtml commented 11 years ago
diff --git a/code/plugin_SeedingData/SeedingData.cpp b/code/plugin_SeedingData/SeedingData.cpp
index 79e9b04..74b071c 100644
--- a/code/plugin_SeedingData/SeedingData.cpp
+++ b/code/plugin_SeedingData/SeedingData.cpp
@@ -135,7 +135,7 @@ void SeedingData::call_RAY_SLAVE_MODE_START_SEEDING(){
                Kmer lastVertex=seed[seed.size()-1];
                Kmer firstReverse=m_parameters->_complementVertex(&lastVertex);

-               if(firstVertex            
sebhtml commented 11 years ago

3 tests started:

[boiseb01@ls28 seb-work]$ bash Tests.sh Will use 24 cores Wed Nov 14 10:42:31 EST 2012 starting ray-k31-seb.1

sebhtml commented 11 years ago

In this runs done by Maxime Déraspe (zorino):

Path: /speedSAS/users/boiseb01/for-max/Ray-kmer31/Ray-Cronus-2012-11-06.1

/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl Best hits: 538 502 467 460 435

/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl Best hits: 3603 2611 2385 2365 2068

Path: /speedSAS/users/boiseb01/for-max/Ray-kmer51/Ray-Cronus-2012-11-08.1

/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl

Best hit: 35296 0 0 0 0 0 1 1 + NC_002516.2 36001 0 35296 contig-7000004 137649 102352 137649

/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl Best hits:

3623 2068 2068 1480 1207

Finally, as expected, the patched Ray produces more seeds:

-rw-rw-r--. 1 boiseb01 boiseb01 13M Nov 14 11:08 All-Seeds.fasta

As Maxime Déraspe said (I reviewed his tests and confirm), the patch changes nothing:

Path: /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13

/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl 699 502 467 460 435

/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl 3603 3603 2611 2611 2385

sebhtml commented 11 years ago

So it seems that the seeds get discarded during the extension. Otherwise, there would be contigs at least the size of the seeds in which we can see parts of the missing region.

sebhtml commented 11 years ago

https://github.com/sebhtml/ray/issues/101

Ah !

I found your logs (thanks by the way for all these nicely done tests!).

I don't need to run my own tests.

I confirmed all your claims with blat, but it seems to me that your missing region is "seeded" correctly in all 3 tests (k31, k51, k31+patch).

The problem is therefore not with seeds at all (we had concluded that together Friday I think).

My next stop is to check your work with kmers.txt

In /speedSAS/users/boiseb01/for-max/Ray-kmer31/Ray-Cronus-2012-11-06.1

kmer-map-gap-real.fa.txt is the latest, but columns are 0 and it's for k=51 in the k=31 run (?)

You should use ../../Ray-kmer51/Ray-Cronus-2012-11-08.1/kmers.txt for k=51, right ?

kmer-map-gap2.txt

=> Looks nice. But I think I will be easier to see in HTML5.

But I'll fix this bug without the HTML5 thing because the HTML5 viewer is not ready.

So we should find an answer to this question:

For Maxime Déraspe's k=31 assembly of Sample_Paeru2, the missing region has several good hits in the seeds (with matches 3603, 2611, 2385, 2365, 2068). Yet, these seeds don't appear in the assembly.

In particular, let's focus on this sequence:

[boiseb01@ls28 Ray-Cronus-2012-11-06.1]$ ls -lh RaySeed-7000015.fasta -rw-rw-r--. 1 boiseb01 boiseb01 3.6K Nov 14 11:28 RaySeed-7000015.fasta

/software/blat/bin/x86_64/blat Contigs.fasta RaySeed-7000015.fasta RaySeed-7000015.fasta-in-Contigs.psl Loaded 6198741 letters in 99 sequences Searched 3603 bases in 1 sequences

So the question is:

Why is the sequence (k-mers) of seed RaySeed-7000015 not in the contigs ?

If Seed-7000015 is well put, then there is not reason for it to be skipped.

There is probably something special about its topology.

I'll put together a kmers.txt-RaySeed-7000015 that will include only k-mers touching this seed.

This will allow me to fix the skipping code, probably by waiting at least 100s k-mers instead of just 1.

I'll post the link to the viewer for this.

The missing region bug will be fixed before my HTML5 viewer hits a first good release.

zorino commented 11 years ago

Indeed I made a mistake in Ray-kmer31-patched/ when I ran kmer-mapping.rb I've used k=51 when trying to map the missing region in the graph so it's normal that it didn'd find anything. I've redone it correctly and the region appeared in the graph so the new one is fine. /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13/kmer-map-gap-real.fa.txt

We should stick with gap-real.fa instead of gap-2 because gap-real is the one that came out of the k51 assembly and there's a deletion (of 1 nt) in gap 2 see at position ~ 10000 in the subject (gap-2) /speedSAS/users/boiseb01/for-max/Ray-kmer51/Parts/gap2.fasta.align

Now that we've found out what happens we're a step closer to a fix.

sebhtml commented 11 years ago

Hey,

On 11/14/2012 12:16 PM, Zorino wrote:

Indeed I made a mistake in Ray-kmer31-patched/ when I ran kmer-mapping.rb I've used k=51 when trying to map the missing region in the graph so it's normal that it didn'd find anything. I've redone it correctly and the region appeared in the graph so the new one is fine. /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13/kmer-map-gap-real.fa.txt

Thanks.

Now I will concentrate my efforts on fetching a subgraph using the seed sequence described in this issue.

We should stick with gap-real.fa instead of gap-2 because gap-real is the one that came out of the k51 assembly and there's a deletion (of 1 nt) in gap 2 see at position ~ 10000 in the subject (gap-2) /speedSAS/users/boiseb01/for-max/Ray-kmer51/Parts/gap2.fasta.align

Thanks to your tests, I don't need anything anymore from the reference sequence although the reference was handy for fishing the fish. ;-)

I have found a piece that appears in seeds, but not in contigs (see above). Fixing that will fix the whole thing.

Now that we've found out what happens we're a step closer to a fix.

Your wording reminds me this: "One Step Closer"

http://www.youtube.com/watch?v=4qlCC1GOwFw

— Reply to this email directly or view it on GitHub https://github.com/sebhtml/ray/issues/101#issuecomment-10374691.

sebhtml commented 11 years ago

Dev is on ls28/git-clones/Ray-Cloud-Browser/tooling

sebhtml commented 11 years ago

Is the seed in a extension pre-merging ?

sebhtml commented 11 years ago

With -write-extensions Ray-Cronus-2012-11-20-extensions

sebhtml commented 11 years ago

The fix will probably to skip a seed only when at least the first 100 vertices were visited. This must be observed in the first flow.

sebhtml commented 11 years ago

/ls28/speedSAS/users/boiseb01/for-max/k31-with-extensions/Ray-Cronus-2012-11-20-extensions

6.3M Nov 21 08:12 All-Seeds.fasta /software/blat/bin/x86_64/blat All-Seeds.fasta ../../seb-work/AssemblyGap/gap2.fasta in-seeds.psl awk '{print $1}' in-seeds.psl |sort -r -n|head -n1 3603

18M Nov 21 08:12 All-Extensions.fasta /software/blat/bin/x86_64/blat All-Extensions.fasta ../../seb-work/AssemblyGap/gap2.fasta in-extensions.psl awk '{print $1}' in-extensions.psl |sort -r -n|head -n5 35296 <**** 538 538 512 502

6.1M Nov 21 00:52 Contigs.fasta

/software/blat/bin/x86_64/blat Contigs.fasta ../../seb-work/AssemblyGap/gap2.fasta in-contigs.psl awk '{print $1}' in-contigs.psl |sort -r -n|head -n1 538

new story:

The missing region is seeded correctly. It is also extended correctly. But it disappear during the merging !

So it is discarded either because the 35296-nt hunk is included in another one, or because it overlaps and the friend took its extra part.

sebhtml commented 11 years ago

The extension is RayExtension-7000015.fasta.

141 k, appears to be a merging problem

sebhtml commented 11 years ago

Fixed.

[boiseb01@ls28 Ray-Cronus-2012-11-20-extensions.2]$ /software/blat/bin/x86_64/blat Contigs.fasta ../seb-work/AssemblyGap/gap2.fasta in-contigs.psl Loaded 6374356 letters in 100 sequences Searched 36001 bases in 1 sequences [boiseb01@ls28 Ray-Cronus-2012-11-20-extensions.2]$ awk '{print $1}' in-contigs.psl |sort -r -n|head -n1 35296

713ea4479cad704eaf95c5dce8528c32d7b62d21

sebhtml commented 11 years ago

Running system tests too:

$ bash main-on-super-computer.sh

sebhtml commented 11 years ago

$ less SystemTests-2012-11-21-16-23-32.validationEntries

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

coli.Ray & 72 & 66 & 4603592 & 63938 & 118479 & 222309 & 0.9943 & 4 & 0 & 7 & 4 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

coli-interleaved.Ray & 72 & 65 & 4603150 & 63932 & 118479 & 222309 & 0.9943 & 4 & 0 & 7 & 4 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

ecoli-MiSeq.Ray & 83 & 78 & 4613123 & 55579 & 95817 & 269165 & 0.9904 & 1 & 1 & 23 & 43 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

L1-L3.Ray & 10 & 3 & 4661003 & 466100 & 475401 & 1765419 & 1.0000 & 2 & 0 & 80 & 31 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

multiple-peaks.Ray & 5 & 4 & 881713 & 176342 & 614024 & 614024 & 0.9993 & 1 & 0 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

pg2.Ray & 6 & 6 & 871450 & 145241 & 612159 & 612159 & 0.9976 & 0 & 0 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

phix.Ray & 1 & 1 & 5386 & 5386 & 5386 & 5386 & 1.0000 & 0 & 0 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

pseudomonas-interleaved-bz2.Ray & 74 & 56 & 6232047 & 84216 & 233788 & 572438 & 0.9974 & 3 & 1 & 1 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

pseudomonas-p.Ray & 103 & 100 & 6220431 & 60392 & 248130 & 580140 & 0.9958 & 0 & 1 & 45 & 2 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

pseudomonas-p-bz2-gz.Ray & 103 & 99 & 6220431 & 60392 & 248130 & 580140 & 0.9958 & 0 & 1 & 45 & 2 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

pseudomonas-s.Ray & 1951 & 1951 & 6032298 & 3091 & 4758 & 29440 & 0.9627 & 1 & 0 & 1 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept.Ray & 104 & 97 & 2000650 & 19237 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-200.Ray & 104 & 97 & 2000643 & 19236 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-300.Ray & 71 & 61 & 2008039 & 28282 & 55479 & 182993 & 0.9839 & 0 & 0 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-amos.Ray & 104 & 97 & 2000678 & 19237 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-bz2.Ray & 105 & 98 & 2000722 & 19054 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-interleaved.Ray & 105 & 98 & 2001380 & 19060 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-interleaved-bz2.Ray & 105 & 98 & 2001352 & 19060 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \

    %  & numberOfContigs &scaffolds & bases & meanSize  & n50  & max   & coverage   & misassembledContigs & misassembledScaffolds & mismatches & indels

strept-simulated-errors.Ray & 99 & 82 & 1997709 & 20178 & 37594 & 127936 & 0.9856 & 0 & 0 & 1 & 0 \