sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Matrix generation must be reproducible and deterministic #216

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

1 graph

3e49e9641c1c92c06dfdcce8a38037a35925fe80 RaySurveyorResults/Surveyor/DistanceMatrix.tsv ec502377ef50660bf49c363586e3d5730de030e5 RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

2 graphs

[boiseb01@ls30 Surveyor]$ sha1sum RaySurveyorResults/Surveyor/* 3e49e9641c1c92c06dfdcce8a38037a35925fe80 RaySurveyorResults/Surveyor/DistanceMatrix.tsv ec502377ef50660bf49c363586e3d5730de030e5 RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

(same)

with 2 graphs:

ed1ffcda8aee9e5e0405c80308c8139a7d3118fc RaySurveyorResults/Surveyor/DistanceMatrix.tsv 4f1ba7bbcc2ca56be68fe072e34259722fbcb50a RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

ed1ffcda8aee9e5e0405c80308c8139a7d3118fc RaySurveyorResults/Surveyor/DistanceMatrix.tsv 4f1ba7bbcc2ca56be68fe072e34259722fbcb50a RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

with Run.sh: (> 30 graphs)

3112fa506ae0424f11877c1f75ddb3eb64d67305 RaySurveyorResults/Surveyor/DistanceMatrix.tsv 8a07e4efcdc757d652d521874688a95d9e44e669 RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

Run 2 is ongoing.

0b8085e816e9fbb6258de681fcbb2e4383cafc2f RaySurveyorResults/Surveyor/DistanceMatrix.tsv b48b1c4fea70222ff07ab97b2bcb6e394baa39ca RaySurveyorResults/Surveyor/SimilarityMatrix.tsv

sebhtml commented 10 years ago

checkout the number of vertices received.

sebhtml commented 10 years ago

The number of received objects should be the same between runs:

[boiseb01@ls30 Surveyor]$ grep received run1/RaySurveyorResults.1.*|grep total run1/RaySurveyorResults.1.00:/actors/75 -> (StoreKeeper) received 76102969 objects in total run1/RaySurveyorResults.1.01:/actors/76 -> (StoreKeeper) received 76147458 objects in total run1/RaySurveyorResults.1.02:/actors/77 -> (StoreKeeper) received 76082322 objects in total run1/RaySurveyorResults.1.03:/actors/78 -> (StoreKeeper) received 76086170 objects in total run1/RaySurveyorResults.1.04:/actors/79 -> (StoreKeeper) received 76064578 objects in total run1/RaySurveyorResults.1.05:/actors/80 -> (StoreKeeper) received 76144246 objects in total run1/RaySurveyorResults.1.06:/actors/81 -> (StoreKeeper) received 76079442 objects in total run1/RaySurveyorResults.1.07:/actors/82 -> (StoreKeeper) received 76091075 objects in total run1/RaySurveyorResults.1.08:/actors/83 -> (StoreKeeper) received 76083800 objects in total run1/RaySurveyorResults.1.09:/actors/84 -> (StoreKeeper) received 76055417 objects in total run1/RaySurveyorResults.1.10:/actors/85 -> (StoreKeeper) received 76078244 objects in total run1/RaySurveyorResults.1.11:/actors/86 -> (StoreKeeper) received 76099717 objects in total run1/RaySurveyorResults.1.12:/actors/87 -> (StoreKeeper) received 76084294 objects in total run1/RaySurveyorResults.1.13:/actors/88 -> (StoreKeeper) received 76063054 objects in total run1/RaySurveyorResults.1.14:/actors/89 -> (StoreKeeper) received 76107152 objects in total run1/RaySurveyorResults.1.15:/actors/90 -> (StoreKeeper) received 76071637 objects in total run1/RaySurveyorResults.1.16:/actors/91 -> (StoreKeeper) received 76080798 objects in total run1/RaySurveyorResults.1.17:/actors/92 -> (StoreKeeper) received 76078240 objects in total run1/RaySurveyorResults.1.18:/actors/93 -> (StoreKeeper) received 76100700 objects in total run1/RaySurveyorResults.1.19:/actors/94 -> (StoreKeeper) received 76095176 objects in total run1/RaySurveyorResults.1.20:/actors/95 -> (StoreKeeper) received 76128060 objects in total run1/RaySurveyorResults.1.21:/actors/96 -> (StoreKeeper) received 76035585 objects in total run1/RaySurveyorResults.1.22:/actors/97 -> (StoreKeeper) received 76058090 objects in total run1/RaySurveyorResults.1.23:/actors/98 -> (StoreKeeper) received 76114137 objects in total run1/RaySurveyorResults.1.24:/actors/99 -> (StoreKeeper) received 76103227 objects in total

sebhtml commented 10 years ago

run2:

[boiseb01@ls30 Surveyor]$ grep received run2/RaySurveyorResults.1.*|grep total run2/RaySurveyorResults.1.00:/actors/75 -> (StoreKeeper) received 76102969 objects in total run2/RaySurveyorResults.1.01:/actors/76 -> (StoreKeeper) received 76147458 objects in total run2/RaySurveyorResults.1.02:/actors/77 -> (StoreKeeper) received 76082322 objects in total run2/RaySurveyorResults.1.03:/actors/78 -> (StoreKeeper) received 76086170 objects in total run2/RaySurveyorResults.1.04:/actors/79 -> (StoreKeeper) received 76064578 objects in total run2/RaySurveyorResults.1.05:/actors/80 -> (StoreKeeper) received 76144246 objects in total run2/RaySurveyorResults.1.06:/actors/81 -> (StoreKeeper) received 76079442 objects in total run2/RaySurveyorResults.1.07:/actors/82 -> (StoreKeeper) received 76091075 objects in total run2/RaySurveyorResults.1.08:/actors/83 -> (StoreKeeper) received 76083800 objects in total run2/RaySurveyorResults.1.09:/actors/84 -> (StoreKeeper) received 76055417 objects in total run2/RaySurveyorResults.1.10:/actors/85 -> (StoreKeeper) received 76078244 objects in total run2/RaySurveyorResults.1.11:/actors/86 -> (StoreKeeper) received 76099717 objects in total run2/RaySurveyorResults.1.12:/actors/87 -> (StoreKeeper) received 76084294 objects in total run2/RaySurveyorResults.1.13:/actors/88 -> (StoreKeeper) received 76063054 objects in total run2/RaySurveyorResults.1.14:/actors/89 -> (StoreKeeper) received 76107152 objects in total run2/RaySurveyorResults.1.15:/actors/90 -> (StoreKeeper) received 76071637 objects in total run2/RaySurveyorResults.1.16:/actors/91 -> (StoreKeeper) received 76080798 objects in total run2/RaySurveyorResults.1.17:/actors/92 -> (StoreKeeper) received 76078240 objects in total run2/RaySurveyorResults.1.18:/actors/93 -> (StoreKeeper) received 76100700 objects in total run2/RaySurveyorResults.1.19:/actors/94 -> (StoreKeeper) received 76095176 objects in total run2/RaySurveyorResults.1.20:/actors/95 -> (StoreKeeper) received 76128060 objects in total run2/RaySurveyorResults.1.21:/actors/96 -> (StoreKeeper) received 76035585 objects in total run2/RaySurveyorResults.1.22:/actors/97 -> (StoreKeeper) received 76058090 objects in total run2/RaySurveyorResults.1.23:/actors/98 -> (StoreKeeper) received 76114137 objects in total run2/RaySurveyorResults.1.24:/actors/99 -> (StoreKeeper) received 76103227 objects in total

sebhtml commented 10 years ago

Received counts by the StoreKeeper are exactly the same:

[boiseb01@ls30 Surveyor]$ cat run1/RaySurveyorResults.1.|grep total|sha1sum 2e0c6046e3fe79f0466a9d03df3415e76cdc5eca - [boiseb01@ls30 Surveyor]$ cat run2/RaySurveyorResults.1.|grep total|sha1sum 2e0c6046e3fe79f0466a9d03df3415e76cdc5eca - [boiseb01@ls30 Surveyor]$ ls -ld run1 run2 drwxrwxr-x. 2 boiseb01 boiseb01 4096 Nov 8 10:54 run1 drwxrwxr-x. 2 boiseb01 boiseb01 4096 Nov 11 14:20 run2

sebhtml commented 10 years ago

actor # 88 has a different count:

[boiseb01@ls30 Surveyor]$ diff -u run1.final run2.final --- run1.final 2013-11-11 14:35:11.649892132 -0500 +++ run2.final 2013-11-11 14:35:18.829899210 -0500 @@ -11,7 +11,7 @@ /actors/85 -> has 52597086 Kmer objects in MyHashTable instance (final) /actors/86 -> has 52594412 Kmer objects in MyHashTable instance (final) /actors/87 -> has 52606996 Kmer objects in MyHashTable instance (final) -/actors/88 -> has 52617078 Kmer objects in MyHashTable instance (final) +/actors/88 -> has 52617200 Kmer objects in MyHashTable instance (final) /actors/89 -> has 52602368 Kmer objects in MyHashTable instance (final) /actors/90 -> has 52577152 Kmer objects in MyHashTable instance (final) /actors/91 -> has 52587710 Kmer objects in MyHashTable instance (final)

sebhtml commented 10 years ago

Run 1 and run 2:

[boiseb01@ls30 Surveyor]$ tail run1/RaySurveyorResults.1.13 /actors/88 -> has 50000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 73000000 objects so far ! /actors/88 -> has 51000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 74000000 objects so far ! /actors/88 -> (StoreKeeper) received 75000000 objects so far ! /actors/88 -> has 52000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 76000000 objects so far ! /actors/88 -> (StoreKeeper) received 76063054 objects in total /actors/88 -> has 52617078 Kmer objects in MyHashTable instance (final) /actors/88 -> will now die (StoreKeeper)

[boiseb01@ls30 Surveyor]$ tail run2/RaySurveyorResults.1.13 /actors/88 -> has 50000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 73000000 objects so far ! /actors/88 -> has 51000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 74000000 objects so far ! /actors/88 -> (StoreKeeper) received 75000000 objects so far ! /actors/88 -> has 52000000 Kmer objects in MyHashTable instance /actors/88 -> (StoreKeeper) received 76000000 objects so far ! /actors/88 -> (StoreKeeper) received 76063054 objects in total /actors/88 -> has 52617200 Kmer objects in MyHashTable instance (final) /actors/88 -> will now die (StoreKeeper)

difference:

run2: 52617200 run1: 52617078

the same number of objects was received.

This only happen on MPI rank 13 (actor # 88)

sebhtml commented 10 years ago

Final counts:

[boiseb01@ls30 Surveyor]$ cat run1/|grep final|sha1sum 53f8a6345020acc56b1e6cb90880e88ee0483ff4 - [boiseb01@ls30 Surveyor]$ cat run2/|grep final|sha1sum 4d9cc1f6b9eb7141494f019f590849967de701e3 - [boiseb01@ls30 Surveyor]$ cat run3/*|grep final|sha1sum 4d9cc1f6b9eb7141494f019f590849967de701e3 -

Total are the same however:

[boiseb01@ls30 Surveyor]$ cat run1/|grep total|sha1sum 2e0c6046e3fe79f0466a9d03df3415e76cdc5eca - [boiseb01@ls30 Surveyor]$ cat run2/|grep total|sha1sum 2e0c6046e3fe79f0466a9d03df3415e76cdc5eca - [boiseb01@ls30 Surveyor]$ cat run3/*|grep total|sha1sum 2e0c6046e3fe79f0466a9d03df3415e76cdc5eca -

sebhtml commented 10 years ago

On mp2, do a -run-surveyor with the same input 5 times.

sebhtml commented 10 years ago

Transfering data to mp2.

$ rsync -avzP /rap/nne-790-ab/projects/Legionella/Assemblage_echantillons_individuel boisver1@corbeil-mp2.rqchp.ca:/mnt/lustre03/corbeil/corbeil_group/nne-790-ab/projects

sebhtml commented 10 years ago

Will submit 5 jobs here:

mp2 /mnt/lustre03/corbeil/corbeil_group/nne-790-ab/projects/Legionella-Gramian-Matrix

Check if this is reproducible:

[boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ pwd /mnt/lustre03/corbeil/corbeil_group/nne-790-ab/projects/Legionella-Gramian-Matrix [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ showq | grep boisv 40136 boisver1 Running 8 23:58:47 Tue Nov 12 11:54:14 40137 boisver1 Running 8 23:59:39 Tue Nov 12 11:55:06 40138 boisver1 Running 8 1:00:00:00 Tue Nov 12 11:55:27 40139 boisver1 Running 8 1:00:00:00 Tue Nov 12 11:55:27

sebhtml commented 10 years ago

Not reproducible, the bug is not fixed yet.

[boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ wc -l $(find .|grep ticket|grep Matrix) 44 ./ticket-216-1/Surveyor/DistanceMatrix.tsv 44 ./ticket-216-1/Surveyor/SimilarityMatrix.tsv 44 ./ticket-216-4/Surveyor/DistanceMatrix.tsv 44 ./ticket-216-4/Surveyor/SimilarityMatrix.tsv 44 ./ticket-216-3/Surveyor/DistanceMatrix.tsv 44 ./ticket-216-3/Surveyor/SimilarityMatrix.tsv 44 ./ticket-216-2/Surveyor/DistanceMatrix.tsv 44 ./ticket-216-2/Surveyor/SimilarityMatrix.tsv 352 total [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ sha1sum $(find .|grep ticket|grep Matrix) 464f8d4baefac8f782097fa79b4a5888698df3b4 ./ticket-216-1/Surveyor/DistanceMatrix.tsv 51cd2c651fe477370713a7d8d517b8758100f890 ./ticket-216-1/Surveyor/SimilarityMatrix.tsv a3b3d9f1edfa06c1d92e8e11c50da2b12c7b1603 ./ticket-216-4/Surveyor/DistanceMatrix.tsv 20eb24faa14d03ab4e702b0d93b07188fe11b54b ./ticket-216-4/Surveyor/SimilarityMatrix.tsv f6c3c9bda029b14ed94098ea694b04b1d3bbcebb ./ticket-216-3/Surveyor/DistanceMatrix.tsv 57879e74a97518abac2533eab3c820e9db4d965c ./ticket-216-3/Surveyor/SimilarityMatrix.tsv f1f411aa6a949a5621ec8182dd1465cd7d150cb9 ./ticket-216-2/Surveyor/DistanceMatrix.tsv 1e7b7a7800dff6bf3345c35765ac698d2a12d388 ./ticket-216-2/Surveyor/SimilarityMatrix.tsv

sebhtml commented 10 years ago

FOr the 4 jobs:

StoreKeeper received the same numbers

[boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep received ticket-216-1.o|grep total|sort|sha1sum 64bb3e13212895bac2a1896236b9480cc654b7ff - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep received ticket-216-2.o|grep total|sort|sha1sum 64bb3e13212895bac2a1896236b9480cc654b7ff - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep received ticket-216-3.o|grep total|sort|sha1sum 64bb3e13212895bac2a1896236b9480cc654b7ff - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep received ticket-216-4.o|grep total|sort|sha1sum 64bb3e13212895bac2a1896236b9480cc654b7ff -

Counts in hash table are the same too:

[boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep final ticket-216-1.o|sort|sha1sum 49464e0fea9fd84c0095ec64ce1ba417964ef97f - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep final ticket-216-2.o|sort|sha1sum 49464e0fea9fd84c0095ec64ce1ba417964ef97f - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep final ticket-216-3.o|sort|sha1sum 49464e0fea9fd84c0095ec64ce1ba417964ef97f - [boisver1@ip03-mp2 Legionella-Gramian-Matrix]$ grep final ticket-216-4.o|sort|sha1sum 49464e0fea9fd84c0095ec64ce1ba417964ef97f -

sebhtml commented 10 years ago

Number of payloads is the same.

[boisver1@cp2359-mp2 Surveyor]$ grep payload run-1/RaySurveyorResults.1.00
/actors/150 -> MatrixOwner received 32400 payloads [boisver1@cp2359-mp2 Surveyor]$ grep payload run-2/RaySurveyorResults.1.00
/actors/150 -> MatrixOwner received 32400 payloads

sebhtml commented 10 years ago

Found problem, testing patch on colosse:

$ msub Run.Job.sh

10547784 $ pwd /home/sboisver12/git-clones/Ray-TestSuite/Ray-Technology-Research/Surveyor $ hostname colosse1

sebhtml commented 10 years ago

Summary:

To check:

sebhtml commented 10 years ago

found the problem.

sebhtml commented 10 years ago

(I think)

sebhtml commented 10 years ago

To include in patch:

In Searcher.cpp (this code works):

    VirtualKmerColorHandle newVirtualColor=m_colorSet.getVirtualColorFrom(virtualColorHandle,color);

    node->setVirtualColor(newVirtualColor);

    m_colorSet.incrementReferences(newVirtualColor);
    m_colorSet.decrementReferences(virtualColorHandle);

In StoreKeeper.cpp:

    VirtualKmerColorHandle newVirtualColor= m_colorSet.getVirtualColorFrom(oldVirtualColor, sampleColor);

    graphVertex->setVirtualColor(newVirtualColor);

    m_colorSet.decrementReferences(oldVirtualColor);
    m_colorSet.incrementReferences(newVirtualColor);

The problem with the second is that oldVirtualColor will be purged during decrementReferences if the number of references is 0, thereby losing some data.

sebhtml commented 10 years ago

problem is fixed, staged changes and running some more tests just to be sure.

ColorSet.cpp remains unmodified !

sebhtml commented 10 years ago

Test passes

[boiseb01@ls30 Surveyor]$ sha1sum RaySurveyorResults-1/Surveyor/SimilarityMatrix.tsv e93f9f9a7a525e916b616c21509f3f7d461a7739 RaySurveyorResults-1/Surveyor/SimilarityMatrix.tsv [boiseb01@ls30 Surveyor]$ sha1sum RaySurveyorResults-2/Surveyor/SimilarityMatrix.tsv e93f9f9a7a525e916b616c21509f3f7d461a7739 RaySurveyorResults-2/Surveyor/SimilarityMatrix.tsv [boiseb01@ls30 Surveyor]$ sha1sum RaySurveyorResults-1/Surveyor/DistanceMatrix.tsv 9c1c5bd18911679db78341b99829ca9493e57e70 RaySurveyorResults-1/Surveyor/DistanceMatrix.tsv [boiseb01@ls30 Surveyor]$ sha1sum RaySurveyorResults-2/Surveyor/DistanceMatrix.tsv 9c1c5bd18911679db78341b99829ca9493e57e70 RaySurveyorResults-2/Surveyor/DistanceMatrix.tsv

sebhtml commented 10 years ago

4ad25e62c960c534717282b6bb4e53d7442fa4fa