Closed sebhtml closed 11 years ago
make HAVE_LIBZ=y HAVE_LIBBZ2=y -j 10 ASSERT=y PREFIX=/speedSAS/users/boiseb01/for-max/Ray-build
ray a248ac9b572eaecc3f56a5155f50bb4bad62b25f RayPlatform ebfde10907d7cce48ed466dc44192e0bd2a281c4
sample is /rap/nne-790-ab/projects/Paul-H-Roy-pseudomonases/Project_PHR/Sample_Paeru2
The missing region has reads (bwa + samtools) The missing region is not missing at k=51
ref. is pao303
On Colosse: /rap/nne-790-ab/projects/Paul-H-Roy-pseudomonases/Project_PHR/MissingRegion
Working directory (ls28): /speedSAS/users/boiseb01/for-max/seb-work
Also a test with this patch:
diff --git a/code/plugin_SeedingData/SeedingData.cpp b/code/plugin_SeedingData/SeedingData.cpp index 79e9b04..74b071c 100644 --- a/code/plugin_SeedingData/SeedingData.cpp +++ b/code/plugin_SeedingData/SeedingData.cpp @@ -135,7 +135,7 @@ void SeedingData::call_RAY_SLAVE_MODE_START_SEEDING(){ Kmer lastVertex=seed[seed.size()-1]; Kmer firstReverse=m_parameters->_complementVertex(&lastVertex);
diff --git a/code/plugin_SeedingData/SeedingData.cpp b/code/plugin_SeedingData/SeedingData.cpp index 79e9b04..74b071c 100644 --- a/code/plugin_SeedingData/SeedingData.cpp +++ b/code/plugin_SeedingData/SeedingData.cpp @@ -135,7 +135,7 @@ void SeedingData::call_RAY_SLAVE_MODE_START_SEEDING(){ Kmer lastVertex=seed[seed.size()-1]; Kmer firstReverse=m_parameters->_complementVertex(&lastVertex); - if(firstVertex
3 tests started:
[boiseb01@ls28 seb-work]$ bash Tests.sh Will use 24 cores Wed Nov 14 10:42:31 EST 2012 starting ray-k31-seb.1
In this runs done by Maxime Déraspe (zorino):
Path: /speedSAS/users/boiseb01/for-max/Ray-kmer31/Ray-Cronus-2012-11-06.1
/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl Best hits: 538 502 467 460 435
/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl Best hits: 3603 2611 2385 2365 2068
Path: /speedSAS/users/boiseb01/for-max/Ray-kmer51/Ray-Cronus-2012-11-08.1
/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl
Best hit: 35296 0 0 0 0 0 1 1 + NC_002516.2 36001 0 35296 contig-7000004 137649 102352 137649
/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl Best hits:
3623 2068 2068 1480 1207
Finally, as expected, the patched Ray produces more seeds:
-rw-rw-r--. 1 boiseb01 boiseb01 13M Nov 14 11:08 All-Seeds.fasta
As Maxime Déraspe said (I reviewed his tests and confirm), the patch changes nothing:
Path: /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13
/software/blat/bin/x86_64/blat Contigs.fasta ../Parts/gap2.fasta Hits-for-gap2-in-contigs.psl 699 502 467 460 435
/software/blat/bin/x86_64/blat All-Seeds.fasta ../Parts/gap2.fasta Hits-for-gap2-in-seeds.psl 3603 3603 2611 2611 2385
So it seems that the seeds get discarded during the extension. Otherwise, there would be contigs at least the size of the seeds in which we can see parts of the missing region.
https://github.com/sebhtml/ray/issues/101
Ah !
I found your logs (thanks by the way for all these nicely done tests!).
I don't need to run my own tests.
I confirmed all your claims with blat, but it seems to me that your missing region is "seeded" correctly in all 3 tests (k31, k51, k31+patch).
The problem is therefore not with seeds at all (we had concluded that together Friday I think).
My next stop is to check your work with kmers.txt
In /speedSAS/users/boiseb01/for-max/Ray-kmer31/Ray-Cronus-2012-11-06.1
kmer-map-gap-real.fa.txt is the latest, but columns are 0 and it's for k=51 in the k=31 run (?)
You should use ../../Ray-kmer51/Ray-Cronus-2012-11-08.1/kmers.txt for k=51, right ?
kmer-map-gap2.txt
=> Looks nice. But I think I will be easier to see in HTML5.
But I'll fix this bug without the HTML5 thing because the HTML5 viewer is not ready.
So we should find an answer to this question:
For Maxime Déraspe's k=31 assembly of Sample_Paeru2, the missing region has several good hits in the seeds (with matches 3603, 2611, 2385, 2365, 2068). Yet, these seeds don't appear in the assembly.
In particular, let's focus on this sequence:
[boiseb01@ls28 Ray-Cronus-2012-11-06.1]$ ls -lh RaySeed-7000015.fasta -rw-rw-r--. 1 boiseb01 boiseb01 3.6K Nov 14 11:28 RaySeed-7000015.fasta
/software/blat/bin/x86_64/blat Contigs.fasta RaySeed-7000015.fasta RaySeed-7000015.fasta-in-Contigs.psl Loaded 6198741 letters in 99 sequences Searched 3603 bases in 1 sequences
So the question is:
Why is the sequence (k-mers) of seed RaySeed-7000015 not in the contigs ?
If Seed-7000015 is well put, then there is not reason for it to be skipped.
There is probably something special about its topology.
I'll put together a kmers.txt-RaySeed-7000015 that will include only k-mers touching this seed.
This will allow me to fix the skipping code, probably by waiting at least 100s k-mers instead of just 1.
I'll post the link to the viewer for this.
The missing region bug will be fixed before my HTML5 viewer hits a first good release.
Indeed I made a mistake in Ray-kmer31-patched/ when I ran kmer-mapping.rb I've used k=51 when trying to map the missing region in the graph so it's normal that it didn'd find anything. I've redone it correctly and the region appeared in the graph so the new one is fine. /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13/kmer-map-gap-real.fa.txt
We should stick with gap-real.fa instead of gap-2 because gap-real is the one that came out of the k51 assembly and there's a deletion (of 1 nt) in gap 2 see at position ~ 10000 in the subject (gap-2) /speedSAS/users/boiseb01/for-max/Ray-kmer51/Parts/gap2.fasta.align
Now that we've found out what happens we're a step closer to a fix.
Hey,
On 11/14/2012 12:16 PM, Zorino wrote:
Indeed I made a mistake in Ray-kmer31-patched/ when I ran kmer-mapping.rb I've used k=51 when trying to map the missing region in the graph so it's normal that it didn'd find anything. I've redone it correctly and the region appeared in the graph so the new one is fine. /speedSAS/users/boiseb01/for-max/Ray-kmer31-patched/Ray-Cronus-2012-11-13/kmer-map-gap-real.fa.txt
Thanks.
Now I will concentrate my efforts on fetching a subgraph using the seed sequence described in this issue.
We should stick with gap-real.fa instead of gap-2 because gap-real is the one that came out of the k51 assembly and there's a deletion (of 1 nt) in gap 2 see at position ~ 10000 in the subject (gap-2) /speedSAS/users/boiseb01/for-max/Ray-kmer51/Parts/gap2.fasta.align
Thanks to your tests, I don't need anything anymore from the reference sequence although the reference was handy for fishing the fish. ;-)
I have found a piece that appears in seeds, but not in contigs (see above). Fixing that will fix the whole thing.
Now that we've found out what happens we're a step closer to a fix.
Your wording reminds me this: "One Step Closer"
http://www.youtube.com/watch?v=4qlCC1GOwFw
— Reply to this email directly or view it on GitHub https://github.com/sebhtml/ray/issues/101#issuecomment-10374691.
Dev is on ls28/git-clones/Ray-Cloud-Browser/tooling
Is the seed in a extension pre-merging ?
With -write-extensions Ray-Cronus-2012-11-20-extensions
The fix will probably to skip a seed only when at least the first 100 vertices were visited. This must be observed in the first flow.
/ls28/speedSAS/users/boiseb01/for-max/k31-with-extensions/Ray-Cronus-2012-11-20-extensions
6.3M Nov 21 08:12 All-Seeds.fasta /software/blat/bin/x86_64/blat All-Seeds.fasta ../../seb-work/AssemblyGap/gap2.fasta in-seeds.psl awk '{print $1}' in-seeds.psl |sort -r -n|head -n1 3603
18M Nov 21 08:12 All-Extensions.fasta /software/blat/bin/x86_64/blat All-Extensions.fasta ../../seb-work/AssemblyGap/gap2.fasta in-extensions.psl awk '{print $1}' in-extensions.psl |sort -r -n|head -n5 35296 <**** 538 538 512 502
6.1M Nov 21 00:52 Contigs.fasta
/software/blat/bin/x86_64/blat Contigs.fasta ../../seb-work/AssemblyGap/gap2.fasta in-contigs.psl awk '{print $1}' in-contigs.psl |sort -r -n|head -n1 538
new story:
The missing region is seeded correctly. It is also extended correctly. But it disappear during the merging !
So it is discarded either because the 35296-nt hunk is included in another one, or because it overlaps and the friend took its extra part.
The extension is RayExtension-7000015.fasta.
141 k, appears to be a merging problem
Fixed.
[boiseb01@ls28 Ray-Cronus-2012-11-20-extensions.2]$ /software/blat/bin/x86_64/blat Contigs.fasta ../seb-work/AssemblyGap/gap2.fasta in-contigs.psl Loaded 6374356 letters in 100 sequences Searched 36001 bases in 1 sequences [boiseb01@ls28 Ray-Cronus-2012-11-20-extensions.2]$ awk '{print $1}' in-contigs.psl |sort -r -n|head -n1 35296
713ea4479cad704eaf95c5dce8528c32d7b62d21
Running system tests too:
$ bash main-on-super-computer.sh
$ less SystemTests-2012-11-21-16-23-32.validationEntries
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
coli.Ray & 72 & 66 & 4603592 & 63938 & 118479 & 222309 & 0.9943 & 4 & 0 & 7 & 4 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
coli-interleaved.Ray & 72 & 65 & 4603150 & 63932 & 118479 & 222309 & 0.9943 & 4 & 0 & 7 & 4 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
ecoli-MiSeq.Ray & 83 & 78 & 4613123 & 55579 & 95817 & 269165 & 0.9904 & 1 & 1 & 23 & 43 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
L1-L3.Ray & 10 & 3 & 4661003 & 466100 & 475401 & 1765419 & 1.0000 & 2 & 0 & 80 & 31 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
multiple-peaks.Ray & 5 & 4 & 881713 & 176342 & 614024 & 614024 & 0.9993 & 1 & 0 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
pg2.Ray & 6 & 6 & 871450 & 145241 & 612159 & 612159 & 0.9976 & 0 & 0 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
phix.Ray & 1 & 1 & 5386 & 5386 & 5386 & 5386 & 1.0000 & 0 & 0 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
pseudomonas-interleaved-bz2.Ray & 74 & 56 & 6232047 & 84216 & 233788 & 572438 & 0.9974 & 3 & 1 & 1 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
pseudomonas-p.Ray & 103 & 100 & 6220431 & 60392 & 248130 & 580140 & 0.9958 & 0 & 1 & 45 & 2 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
pseudomonas-p-bz2-gz.Ray & 103 & 99 & 6220431 & 60392 & 248130 & 580140 & 0.9958 & 0 & 1 & 45 & 2 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
pseudomonas-s.Ray & 1951 & 1951 & 6032298 & 3091 & 4758 & 29440 & 0.9627 & 1 & 0 & 1 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept.Ray & 104 & 97 & 2000650 & 19237 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-200.Ray & 104 & 97 & 2000643 & 19236 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-300.Ray & 71 & 61 & 2008039 & 28282 & 55479 & 182993 & 0.9839 & 0 & 0 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-amos.Ray & 104 & 97 & 2000678 & 19237 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-bz2.Ray & 105 & 98 & 2000722 & 19054 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-interleaved.Ray & 105 & 98 & 2001380 & 19060 & 36431 & 128039 & 0.9846 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-interleaved-bz2.Ray & 105 & 98 & 2001352 & 19060 & 36431 & 128039 & 0.9845 & 0 & 1 & 0 & 0 \
% & numberOfContigs &scaffolds & bases & meanSize & n50 & max & coverage & misassembledContigs & misassembledScaffolds & mismatches & indels
strept-simulated-errors.Ray & 99 & 82 & 1997709 & 20178 & 37594 & 127936 & 0.9856 & 0 & 0 & 1 & 0 \
Path: /speedSAS/users/boiseb01/for-max
Problem: A region is in the graph, but does not appear in assembly k=31. It appears in assembly k=51.
Agenda:
Reproduce experiments at k=31 and k=51 with latest build