sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
65 stars 12 forks source link

Redundant sequences with bigger k #141

Closed fredericraymond closed 11 years ago

fredericraymond commented 11 years ago

The data are : /rap/nne-790-ab/projects/Project_CQDM2/CQDM_Run1/Sample_CQDM2-3-K51-SilverRay-2013-01-22/Assembly

Example blast :

contig-21000007 contig-20000007 99.98 227992 37 1 31788 259773 227992 1 0.0e+00 445214.0 contig-20000007 contig-21000007 99.98 227992 37 1 1 227992 259773 31788 0.0e+00 445214.0 contig-23000019 contig-19000012 100.00 156020 0 0 19943 175962 1 156020 0.0e+00 304664.0 contig-19000012 contig-23000019 100.00 156020 0 0 1 156020 19943 175962 0.0e+00 304664.0 contig-47 contig-17 100.00 116408 5 0 1 116408 1 116408 0.0e+00 227355.0 contig-17 contig-47 100.00 116408 5 0 1 116408 1 116408 0.0e+00 227355.0 contig-19000012 contig-16000061 100.00 106628 0 0 122434 229061 106628 1 0.0e+00 208438.0 contig-16000061 contig-19000012 100.00 106628 0 0 1 106628 229061 122434 0.0e+00 208438.0 contig-23000019 contig-21000007 100.00 104746 0 0 142376 247121 1 104746 0.0e+00 204696.0 contig-21000007 contig-23000019 100.00 104746 0 0 1 104746 142376 247121 0.0e+00 204696.0 contig-60 contig-47 100.00 90318 1 0 163462 253779 1 90318 0.0e+00 176776.0 contig-60 contig-17 100.00 90318 1 0 163462 253779 1 90318 0.0e+00 176776.0 contig-47 contig-60 100.00 90318 1 0 1 90318 163462 253779 0.0e+00 176776.0 contig-17 contig-60 100.00 90318 1 0 1 90318 163462 253779 0.0e+00 176776.0 contig-23000019 contig-20000007 99.94 72965 37 1 174163 247121 227992 155028 0.0e+00 142398.0 contig-20000007 contig-23000019 99.94 72965 37 1 155028 227992 247121 174163 0.0e+00 142398.0 contig-23000019 contig-19000012 100.00 71159 0 0 175963 247121 229062 300220 0.0e+00 138984.0 contig-21000007 contig-19000012 100.00 71159 0 0 33588 104746 229062 300220 0.0e+00 138984.0

sebhtml commented 11 years ago


mpiexec -n 64 Ray \ -o \ Assembly \ -k \ 51 \ -p \ Sample/CQDM2-3_Lane5_R1_1.fastq.gz \ Sample/CQDM2-3_Lane5_R2_1.fastq.gz \ -p \ Sample/CQDM2-3_Lane5_R1_2.fastq.gz \ Sample/CQDM2-3_Lane5_R2_2.fastq.gz \ -p \ Sample/CQDM2-3_Lane5_R1_3.fastq.gz \ Sample/CQDM2-3_Lane5_R2_3.fastq.gz \ -p \ Sample/CQDM2-3_Lane5_R1_4.fastq.gz \ Sample/CQDM2-3_Lane5_R2_4.fastq.gz \ -p \ Sample/CQDM2-3_Lane5_R1_5.fastq.gz \ Sample/CQDM2-3_Lane5_R2_5.fastq.gz \ -p \ Sample/CQDM2-3_Lane5_R1_6.fastq.gz \ Sample/CQDM2-3_Lane5_R2_6.fastq.gz \ -search \ /rap/nne-790-ab/genomes/EMBL_CDS+GO/EMBL_CDS_Sequences \ -gene-ontology \ /rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Ontologies.txt \ /rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Annotations.txt \ -search \ /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/ARDB \ -search \ /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Bacteria-Genomes \ -search \ /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/HumanChromosomes \ -search \ /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/NCBI-Bacteria_DRAFT \ -search \ /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Viruses-Genomes \ -with-taxonomy \ /rap/nne-790-ab/genomes/taxonomy/last-build/Genome-to-Taxon.tsv \ /rap/nne-790-ab/genomes/taxonomy/last-build/TreeOfLife-Edges.tsv \ /rap/nne-790-ab/genomes/taxonomy/last-build/Taxon-Names.tsv

42127784 reads


$ grep contig-21000007 Assembly/Contigs.fasta

contig-21000007 265935 nucleotides $ grep contig-20000007 Assembly/Contigs.fasta contig-20000007 258265 nucleotides

contig-21000007 (265935 letters)
  259773                                                                32710
    |                                                                              |
    |                                                                              |
    1                                                                        227064

contig-20000007 (258265 letter)

sebhtml commented 11 years ago

$ msub

10196530 $ pwd /rap/nne-790-ac/projects/Ray-Issue-141

sebhtml commented 11 years ago
sebhtml commented 11 years ago
sebhtml commented 11 years ago
fredericraymond commented 11 years ago

I'm not sure, but probably. To a lesser extent, that's for sure.

Frédéric Raymond, Ph. D. Équipe de Jacques Corbeil Centre de recherche en infectiologie Centre de recherche CHUQ-CHUL 2705, boulevard Laurier Sainte-Foy (Québec) G1V 4G2 Canada Téléphones : 418 525-4444 poste 46333 ou 418 654-2296 Courriel :

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

De : Sébastien Boisvert [] Date d'envoi : 24 janvier 2013 10:49 À : sebhtml/ray Cc : Frederic Raymond Objet : Re: [ray] Redundant sequences with bigger k (#141)

— Reply to this email directly or view it on GitHub

sebhtml commented 11 years ago

./Sample_CQDM2-3-SilverRay-2012-12-20/Assembly/OutputNumbers.txt (-k 31)

Scaffolds >= 500 nt Number: 95 Total length: 4115254 Average: 43318 N50: 71438 Median: 33623 Largest: 231887


Scaffolds >= 500 nt Number: 70 Total length: 4570993 Average: 65299 N50: 90796 Median: 61553 Largest: 296414


Scaffolds >= 500 nt Number: 45 Total length: 5086488 Average: 113033 N50: 209165 Median: 91321 Largest: 338057


Scaffolds >= 500 nt Number: 32 Total length: 4172116 Average: 130378 N50: 310035 Median: 89956 Largest: 521968


Scaffolds >= 500 nt Number: 35 Total length: 4368521 Average: 124814 N50: 233631 Median: 89243 Largest: 493924


Scaffolds >= 500 nt Number: 44 Total length: 4219489 Average: 95897 N50: 145661 Median: 74685 Largest: 453193


Scaffolds >= 500 nt Number: 643 Total length: 4592452 Average: 7142 N50: 10929 Median: 4968 Largest: 49827


Scaffolds >= 500 nt Number: 2464 Total length: 3175881 Average: 1288 N50: 1465 Median: 1014 Largest: 10908

sebhtml commented 11 years ago

In the map: Sample-2-3 => Contigs => contig-21000019 => 174190

[sebhtml@ip-10-122-59-216 server]$ ./RayCloudBrowser.probePath debug Ray-Cloud-Browser-data/Sample-2-3/Contigs.dat |head --- Ray Technologies ---

Magic: 2345678989 Objects: 1177 [0] 37680 34 37714 408310 name=contig-21000019 408310 nucleotides sequence=CTGCAAGGATGGAAATGTCAAATGGATATCA... [1] 446024 34 446058 338057 name=contig-19000041 338057 nucleotides sequence=GCGCCTAGCGGCGCACGTTTCTAATGGGTGA... [2] 784115 34 784149 315777 name=contig-19000012 315777 nucleotides sequence=TCGATGTCCTCCGGCGATGTCGGCACAAACT... [3] 1099926 34 1099960 300220 name=contig-20000012 300220 nucleotides sequence=GATACCATGTATTACCCGGAGGGAGTCACCG... [4] 1400180 34 1400214 258265 name=contig-17000007 258265 nucleotides sequence=CGGCGGGGAAATTGGTGCAGAAAGCTGCGTC... [5] 1658479 28 1658507 255343 name=contig-60 255343 nucleotides sequence=TACATGGAGTCTGTACTTTGATCGTCATATA...

No seed is that long:

[sebhtml@ip-10-122-59-216 server]$ ./RayCloudBrowser.probePath debug Ray-Cloud-Browser-data/Sample-2-3/Seeds.dat |head --- Ray Technologies ---

Magic: 2345678989 Objects: 2913 [0] 93232 10 93242 31095 name=RaySeed-40 sequence=ACTTCACACCGTACGGTACCAGCCGGGACGG... [1] 124337 9 124346 19948 name=RaySeed-6 sequence=CCAGCCGCCCCCGCCGATCAGCCCGCAGTAC... [2] 144294 10 144304 18515 name=RaySeed-14 sequence=CCGCCGATAGCGGCACCGATAAATGCAAGGC... [3] 162819 10 162829 16365 name=RaySeed-27 sequence=AGCTTGACGCAGCTTTCTGCACCAATTTCCC... [4] 179194 15 179209 15241 name=RaySeed-1000014 sequence=CTGCCGTATCCGTGAGTTGCCTTCGTCACTT... [5] 194450 10 194460 15029 name=RaySeed-60 sequence=GGTGTACACATAGCTGCGTGTTTTTTACTGA...

And no extension is that long neither.

[sebhtml@ip-10-122-59-216 server]$ ./RayCloudBrowser.probePath debug Ray-Cloud-Browser-data/Sample-2-3/Extensions.dat |head --- Ray Technologies ---

Magic: 2345678989 Objects: 1288 [0] 41232 20 41252 279064 name=RayExtension-6000043 sequence=TCGATGTCCTCCGGCGATGTCGGCACAAACT... [1] 320316 15 320331 279064 name=RayExtension-32 sequence=AAAAATGTGTATGGTCATTTTTGCGCAATGA... [2] 599395 21 599416 277413 name=RayExtension-19000044 sequence=CGCCGACGTCGCCGAAGGTCTTCCCCTGGAG... [3] 876829 20 876849 255343 name=RayExtension-6000060 sequence=TACATGGAGTCTGTACTTTGATCGTCATATA... [4] 1132192 20 1132212 255309 name=RayExtension-1000056 sequence=CTCCAATCCCCGAAGCGGTCCCTCATCGATA... [5] 1387521 14 1387535 213744 name=RayExtension-6 sequence=GTGTTATAATCTCCTTTTTCTTATTCTGCCA...

sebhtml commented 11 years ago

227915 matches

27 0 0 6 16 6 16 -

contig1 -> contig-17000007 258265 0 227958

contig2 -> contig-21000019 408310 174190 4021487

sebhtml commented 11 years ago

check JoinerWorker, something is wrong it seems that the algorithm performs the join even if there is stuff on both side that differ. Probably hitting a hard limit.

sebhtml commented 11 years ago


Contigs >= 100 nt Number: 1173 Total length: 4349486 Average: 3708 N50: 175543 Median: 108 Largest: 337729 Contigs >= 500 nt Number: 46 Total length: 4213647 Average: 91601 N50: 175543 Median: 63754 Largest: 337729 Scaffolds >= 100 nt Number: 1168 Total length: 4350426 Average: 3724 N50: 175543 Median: 108 Largest: 337729 Scaffolds >= 500 nt Number: 41 Total length: 4214587 Average: 102794 N50: 176987 Median: 82210 Largest: 337729

Rank 0 wrote /ltmp/boisver1/Sample_CQDM2-3-Ray-27/Contigs.fasta Rank 0 wrote /ltmp/boisver1/Sample_CQDM2-3-Ray-27/Scaffolds.fasta Check for /ltmp/boisver1/Sample_CQDM2-3-Ray-27/*