glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
79 stars 26 forks source link

Problem with tycoon database?? #108

Open tdlong opened 6 years ago

tdlong commented 6 years ago

I am having a weird problem with progressive Cactus, although it has normally worked for me in its current configuration. I submit a job (aligning human, mouse, rat, and a Peromyscus mouse) and it runs for several days then get locked in some sort of death loop. Once it fails it just keeps failing. The problems in the log file seem to start here. Is this an error message that gives you guys any idea what I am doing wrong, it is difficult for me to interpret the error (there is over 500G of memory).

Got message from job at time: 1525843963.25 : Starting reference phase target with index 0 at 1525843963.2 seconds (recursing = 1) Got message from job at time: 1525843973.33 : Blocking on ktserver <kyoto_tycoon database_dir="/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174" database_name="Anc0.kch" i n_memory="1" port="2078" snapshot="0" /> with killPath /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/gTD1/tmp_6GL2ml6EOK/tmp_9EWsl0wYJh_kill.txt Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434639716 total bases: 83317999 total-ends: 1770 total-caps: 9 266 max-end-degree: 74 max-adjacency-length: 2489366 total-blocks: 0 total-groups: 1 total-edges: 1442 total-free-ends: 6 total-attached-ends: 1764 total-chains: 0 total-link groups: 0 Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434640271 total bases: 75659437 total-ends: 1788 total-caps: 1 0268 max-end-degree: 133 max-adjacency-length: 18132749 total-blocks: 0 total-groups: 1 total-edges: 1347 total-free-ends: 24 total-attached-ends: 1764 total-chains: 0 total-link groups: 0 Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434640067 total bases: 58824370 total-ends: 2787 total-caps: 1 4716 max-end-degree: 104 max-adjacency-length: 9797885 total-blocks: 0 total-groups: 1 total-edges: 2253 total-free-ends: 63 total-attached-ends: 2724 total-chains: 0 total-link groups: 0 Got message from job at time: 1525844914.87 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 4947485665643181168 total bases: 83317999 total-ends: 124726 total-caps: 1049126 max-end-degree: 74 max-adjacency-length: 1100541 total-blocks: 60543 total-groups: 57688 total-edges: 64313 total-free-ends: 1876 total-attached-ends: 1764 total-chains: 3474 total-link groups: 55437 Got message from job at time: 1525844981.29 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 430515976878946939 total bases: 75659437 total-ends: 105496 total-caps: 977896 max-end-degree: 133 max-adjacency-length: 18090926 total-blocks: 50881 total-groups: 47607 total-edges: 55100 total-free-ends: 1970 total-attached-ends: 1764 total-chains: 3617 total-link groups: 45036 Got message from job at time: 1525845179.69 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 430515976878946123 total bases: 58824370 total-ends: 127341 total-caps: 1817786 max-end-degree: 104 max-adjacency-length: 9787512 total-blocks: 60028 total-groups: 50800 total-edges: 70779 total-free-ends: 4561 total-attached-ends: 2724 total-chains: 6932 total-link groups: 45031 Got message from job at time: 1525845270.82 : Adding an oversize flower 4947485665643181168 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> Got message from job at time: 1525845397.24 : Adding an oversize flower 3964997259434639716 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> Got message from job at time: 1525845445.12 : Adding an oversize flower 430515976878946939 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> Got message from job at time: 1525845511.73 : Adding an oversize flower 430515976878946123 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> Got message from job at time: 1525845542.07 : Adding an oversize flower 3964997259434640271 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> Got message from job at time: 1525845620.7 : Adding an oversize flower 3964997259434640067 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'> The job seems to have left a log file, indicating failure: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t0/t1/t1/t1/job Reporting file: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t0/t1/t1/t1/log.txt log.txt: 9 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 4 1 58969007620880249 66 49 30 260 27 65 30 450 30 30 49 30 30 104 30 27 30 49 68 49 30 30 88 39 30 39 30 49 30 49 30 27 49 30 30 77 27 49 60 30 61 30 163 68 85 215 63 88 213 27 590 49 49 75 72 38 24 38 24 69 159 36 53 53 40 62 30 85 24 40 65 33 24 24 40 24 33 24 38 69 42 138 64 40 88 85 315 27 46 30 27 27 30 30 199 49 68 68 55 71 49 30 42 46 84 49 131 30 79 30 55 30 30 27 49 30 27 49 202 30 50 27 30 152 119 40 127 87 30 30 46 72 52 68 66 106 24 40 56 65 24 24 24 51 76 45 87 265 36 36 36 33 1958 67 71 64 24 24 43 30 58 84 27 56 93 27 87 30 106 49 87 49 304 30 110 30 77 49 30 30 27 58 68 49 49 87 39 68 27 64 27 27 147 49 507 30 88 49 93 52 30 77 30 71 239 27 30 30 49 87 30 66 77 27 144 30 29132660089536400 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 4 1 45035996273704294 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 1 9 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 1 ...stuff deleted... 160 97 27 54 421 54 63 54 27 51 54 39 73 116 71 54 73 30 60 30 30 30 30 51 350 27 39 39 30 39 46 30 30 54 27 39 102 63 73 30 51 92 49 97 27 27 30 69 61 94 55 27 54 54 27 54 209 49 73 94 135 27 73 135 108 73 100 70 54 51 379 175 54 63 27 63 27 66 30 54 65 49 39 54 30 49 103 97 106 51 54 30 30 39 567 30 283 71 73 30 49 116 30 70 30 27 30 51 39 27 144 30 54 30 66 41 39 73 94 54 109 27 115 66 68 197 30 30 44 8022036836287588 127 49 30 27 84 30 27 47 112 27 49 74 103 30 148 48 30 30 30 68 97 156 30 85 97 30 54 185 30 49 137 49 54 73 51 58 54 30 52 82 27 73 52 54 30 92 30 54 49 51 30 98 30 27 254 30 30 30 73 68 92 30 30 27 52 178 140 73 27 30 30 49 49 27 27 101 152 30 49 87 27 58 58 30 66 68 27 27 49 49 30 39 30 27 64 27 27 49 68 27 30 39 86 2 7 30 86 68 30 68 30 30 39 27 46 27 30 49 49 68 49 30 30 30 39 39 30 27 30 98 61 77 49 48 49 63 42 27 30 30 68 30 39 27 49 68 42 168 46 30 30 104 30 27 49 61 49 30 69 30 47 77 27 49 30 49 47 30 49 47 49 87 30 157 67 30 46 27 58 87 52 49 30 27 264 49 100 360 49 49 30 30 30 87 46 30 113 27 116 73 71 54 30 54 51 30 73 49 49 27 122 47 55 97 30 27 30 27 63 70 92 39 119 54 71 54 54 51 46 51' exited with non-zero status 128 log.txt: Exiting the slave because of a failed job on host compute-4-43.local log.txt: Due to failure we are reducing the remaining retry count of job /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t0/job to 0 log.txt: We have set the default memory of the failed job to 34359738368 bytes Job: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t0/job is completely failed The job seems to have left a log file, indicating failure: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t1/job Reporting file: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t1/log.txt log.txt: ---JOBTREE SLAVE OUTPUT LOG--- log.txt: Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.1.255.117 with error: network error log.txt: Uncaught exception log.txt: Traceback (most recent call last): log.txt: File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) log.txt: File "/data/apps/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute log.txt: self.target.run() log.txt: File "/data/apps/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 804, in run log.txt: bottomUpPhase=True) log.txt: File "/data/apps/progressiveCactus/submodules/cactus/shared/common.py", line 382, in runCactusAddReferenceCoordinates log.txt: popenPush(command, stdinString=flowerNames) log.txt: File "/data/apps/progressiveCactus/submodules/sonLib/bioio.py", line 224, in popenPush log.txt: raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts)) log.txt: RuntimeError: Command: cactus_addReferenceCoordinates --cactusDisk ' log.txt: log.txt: log.txt: ' --secondaryDisk ' log.txt: <kyoto_tycoon database_dir="/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174" database_name="Anc0.kch" host="10.1.255.117" in_memory="1" port="2078" sna pshot="0" /> log.txt: ' --logLevel CRITICAL --referenceEventString Anc0 --bottomUpPhase with stdin string '8712 8994814355765721555 79 30 49 63 73 49 54 73 54 73 51 55 30 68 75 54 116 54 92 92 140 54 94 63 51 73 54 92 54 94 30 27 46 27 53 65 30 87 27 27 77 49 27 113 71 174 12666373951930501 30 49 71 58 30 30 56 116 137 41 73 30 30 27 92 30 97 39 54 30 27 101 116 30 30 54 73 30 54 54 82 30 104 41 30 42 27 30 73 42 30 100 46 93 54 30 150 97 97 54 109 70 94 49 130 49 97 49 39 1 49 30 30 39 49 41 49 49 30 49 30 30 49 30 27 30 65 30 54 39 49 46 73 120 65 49 30 54 27 93 27 63 27 134 47 30 49 39 30 77 27 77 54 92 30 30 92 30 52 49 71 30 46 27 73 64 51 94 190 97 54 54 54 51 97 30 73 30 71 30 73 97 39 97 144 66 46 49 30 30 30 46 212 73 80 54 80 49 30 30 49 71 39 54 93 54 49 54 54 49 30 71 134 30 54 30 49 30 49 67 73 160 54 30 27 73 73 54 73 30 30 2354 49 51 30 87 27 39 27 42 30 30 46 135 104 97 63 30 30 30 51 30 92 116 118 95 54 73 39 68 92 63 196 46 49 30 27 30 183 30 30 30 30 55 27 49 71 39 129 30 30 73 30 30 63 41 27 41 54 54 7740561859526669 92 73 159 116 54 54 30 63 49 30 159 30 49 122 55 27 30 49 30 171 54 178 69 73 127 30 30 30 46 96 49 85 54 54 51 116 30 39 106 30 73 55 30 68 84 93 159 97 41 103 111 27 49 30 99 30 98 73 39 27 27 183 27 240 49 54 52 39 39 135 60 30 30 30 27 73 30 27 30 27 30 65 97 70 54 116 27 194 30 30 39 126 54 73 114 30 52 106 30 51 51 73 54 68 80 54 68 30 116 39 51 63 218 54 27 30 106 94 27 30 30 106 30 132 54 39 30 228 46 131 27 55 30 30 48 49 30 47 51 71 178 97 27 51 142 54 74 154 30 94 73 30 30 30 30 30 49 30 30 130 73 30 670 30 27 54 51 30 65 27 116 30 30 27 49 46 49 85 47 27 30 54 54 54 63 74 92 27 39 30 30 121 54 54 53 73 70 54 73 107 49 147 49 27 30 30 94 49 30 42 30 94

joelarmstrong commented 6 years ago

Hmm, it sounds like the database crashed, or at least became inaccessible. That could be for any number of reasons, but if it's not just some random fluke, it's likely crashed because it ran out of memory. Once the database is gone, it definitely enters the sort of "death loop" you're talking about. The only way out is to stop and restart from the last subproblem, with the --restart flag. That should work if the database crash was caused by some fluke, but if it's a memory issue it would just crash after a few days again.

Can you share the contents of these files (database logs) if they exist?:

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB/ktout.log

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174/ktout.log

tdlong commented 6 years ago

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB/ktout.log

2018-05-01T09:45:48.483191-08:00: [SYSTEM]: ================ [START]: pid=33941 2018-05-01T09:45:48.483330-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-05-01T09:45:48.486521-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:1978 2018-05-01T09:45:48.486666-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:1978 timeout=200000.0 2018-05-01T09:45:48.486707-08:00: [SYSTEM]: listening server socket started: fd=11

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174/ktout.log

2018-05-08T21:32:43.529278-08:00: [SYSTEM]: ================ [START]: pid=21467 2018-05-08T21:32:43.529540-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-05-08T21:32:43.560365-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:2078 2018-05-08T21:32:43.560467-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:2078 timeout=200000.0 2018-05-08T21:32:43.560492-08:00: [SYSTEM]: listening server socket started: fd=11

tdlong commented 6 years ago

I was able to get a job to run to completion that had worked before. So then perhaps it is something to do with the length of my input sequences (being long scaffolds as opposed to short scaffolds). Or just the total run time. I will try a restart.

On May 9, 2018, at 12:27 PM, Anthony Long tdlong@uci.edu wrote:

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB/ktout.log

2018-05-01T09:45:48.483191-08:00: [SYSTEM]: ================ [START]: pid=33941 2018-05-01T09:45:48.483330-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-05-01T09:45:48.486521-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:1978 2018-05-01T09:45:48.486666-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:1978 timeout=200000.0 2018-05-01T09:45:48.486707-08:00: [SYSTEM]: listening server socket started: fd=11

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174/ktout.log

2018-05-08T21:32:43.529278-08:00: [SYSTEM]: ================ [START]: pid=21467 2018-05-08T21:32:43.529540-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-05-08T21:32:43.560365-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:2078 2018-05-08T21:32:43.560467-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:2078 timeout=200000.0 2018-05-08T21:32:43.560492-08:00: [SYSTEM]: listening server socket started: fd=11

tdlong commented 6 years ago

I am not sure how to use the --restart flag

runProgressiveCactus.sh --maxThreads 32 --restart pero.txt PCwork PCwork/pero.hal

...

Usage: runProgressiveCactus.sh [options]

Required Arguments:

File containing newick tree and seqeunce paths paths. (see documetation or examples for format). Working directory (which can grow exteremely large) Path of output alignment in .hal format. progressiveCactus.py: error: no such option: --restart The job seems to have left a log file, indicating failure: /share/adl/tdlong/peromyscus/Progressive/jobTree/jobs/job > On May 9, 2018, at 12:27 PM, Anthony Long wrote: > > > > /share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB/ktout.log > > > 2018-05-01T09:45:48.483191-08:00: [SYSTEM]: ================ [START]: pid=33941 > 2018-05-01T09:45:48.483330-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p > 2018-05-01T09:45:48.486521-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:1978 > 2018-05-01T09:45:48.486666-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:1978 timeout=200000.0 > 2018-05-01T09:45:48.486707-08:00: [SYSTEM]: listening server socket started: fd=11 > > /share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174/ktout.log > > > 2018-05-08T21:32:43.529278-08:00: [SYSTEM]: ================ [START]: pid=21467 > 2018-05-08T21:32:43.529540-08:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p > 2018-05-08T21:32:43.560365-08:00: [SYSTEM]: starting the server: expr=10.1.255.117:2078 > 2018-05-08T21:32:43.560467-08:00: [SYSTEM]: server socket opened: expr=10.1.255.117:2078 timeout=200000.0 > 2018-05-08T21:32:43.560492-08:00: [SYSTEM]: listening server socket started: fd=11 > >
joelarmstrong commented 6 years ago

Oops, sorry about that. I was thinking of the newer toil syntax. progressiveCactus should restart just fine without the --restart option.

tdlong commented 6 years ago

It has been running a few days now from the restart. It seems to be running a number of lastz jobs in parallel. I will see if it crashes again and let you know either way. If it does crash it seems difficult to determine from the logs if it was the same "step" or not.

With regards to memory problems. Was the step where this likely to have occurred a perfect storm type event where the right combination of lastz jobs running in parallel over loaded the node. Or some other step post lastz where more memory is needed.

Thanks.

Tony

On May 9, 2018, at 7:52 PM, Joel Armstrong notifications@github.com wrote:

Oops, sorry about that. I was thinking of the newer toil syntax. progressiveCactus should restart just fine without the --restart option.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/108#issuecomment-387936263, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN0Ly4W2lTQqlpPXEEp4HOXTr_-9cks5tw6tpgaJpZM4T4pZf.

tdlong commented 6 years ago

Yes it is running out of memory despite being on a node with 500GB. (see below). My computer people offered me a 1.5TB node, but in my experience if 500Gb is not enough perhaps 1.5TB isn’t either. This is odd to me, several months ago on my same node I ran progressive cactus on these 4 species (human, rat, mouse, peromyscus) plus hamster. The difference is that now we are running on the 4 species, except peromyscus NOW has chromosome length scaffolds, whereas before it had an N50 of 4Mb. So is there something about the length of fasta records that uses more memory? Is there any sort of work-around that people are using? It seems odd that scaffolding a genome suddenly creates a memory problem, although clearly my insight into the inner workings of Progressive Cactus are limited. Tony

############################

Yes, you are running out of memory. I am running htop and it is slowly increasing in memory usage.

Also this:

*Out of memory: Kill process 33946 (ktserver) score 138 or sacrifice
child*
Killed process 33946 (ktserver) total-vm:75785848kB,
anon-rss:70737612kB, file-rss:120kB
cactus_addRefer[73157]: segfault at 0 ip 0000000000421890 sp
00007fff502a54a8 error 4 in cactus_addReferenceCoordinates[400000+65000]
cactus_addRefer[73428]: segfault at 0 ip 0000000000421890 sp
00007ffcf1c40308 error 4 in cactus_addReferenceCoordinates[400000+65000]
cactus_addRefer[73167]: segfault at 0 ip 0000000000421890 sp
00007ffd9416b698 error 4 in cactus_addReferenceCoordinates[400000+65000]
cactus_addRefer[73161]: segfault at 0 ip 0000000000421890 sp
00007fff0bc2b628 error 4
cactus_addRefer[71617]: segfault at 0 ip 0000000000421890 sp
00007ffdf91927d8 error 4 in cactus_addReferenceCoordinates[400000+65000]
  in cactus_addReferenceCoordinates[400000+65000]
joelarmstrong commented 6 years ago

Hmm, weird. The N50 of the assemblies shouldn't really have much of an effect on the memory usage. Are all the assemblies soft masked? Unmasked assemblies can cause really high memory usage because of the amount of alignments, though it usually causes problems at an earlier stage.

tdlong commented 6 years ago

All 4 genomes are soft-masked (with lower case letters).

I think my next steps are to

  1. rerun without the new scaffolded assembly
  2. rerun with the newer assembly as contigs as opposed to scaffolds
  3. rerun with the new scaffold assembly and just rat

These 3 experiments will hopefully allow me to minimally reproduce the problem

On May 12, 2018, at 6:02 AM, Joel Armstrong notifications@github.com wrote:

Hmm, weird. The N50 of the ass shouldn't really have much of an effect on the memory usage. Are all the assemblies soft masked? Unmasked assemblies can cause really high memory usage because of the amount of alignments, though it usually causes problems at an earlier stage.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/108#issuecomment-388553773, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN7f0ieNz9_9PlNRy4WN_sGUIh59rks5txt1UgaJpZM4T4pZf.

joelarmstrong commented 6 years ago

Yes, that sounds great. Thanks for helping to track this down!

tdlong commented 6 years ago

I am running some experiments now. I am aligning rat to my scaffolded genome, and rat to contigs only.

Right now the program is in the phase where lastz is running on all the cores I give it. When I monitor the job with htop the memory use is slowing increasing over time. This seems odd to me, as presumably lastz jobs are starting and finishing, so why would memory just keep climbing over time. It almost looks like a good old fashioned leak.

Tony

tdlong commented 6 years ago

Progressive Cactus continues to crash. It runs about 24 hours (depending on the number of cores) and then bad things happen. log below. It is so strange, as it used to just work.

I have some scientific computing people helping me now as well. (So if it is a weird system / software interaction perhaps we can figure it out).

Thanks. Tony.

more cactus.log

2018-05-18 10:20:52.571398: Beginning Progressive Cactus Alignment

Got message from job at time: 1526664090.64 : Before running any preprocessing on the assembly: /share/adl/tdlong/peromyscus/Progr essive/genomes/RMout/peromyscus_assembly_scaffolds.fasta.masked got following stats (assembly may be listed as temp file if input sequences from a directory): Input-sample: /share/adl/tdlong/peromyscus/Progressive/genomes/RMout/peromyscus_assembly_scaffolds.fa sta.masked Total-sequences: 1856 Total-length: 2475164510 Proportion-repeat-masked: 0.362917 ProportionNs: 0.000456 Total-Ns: 1128 351 N50: 114273790 Median-sequence-length: 17852 Max-sequence-length: 193658164 Min-sequence-length: 1000 Got message from job at time: 1526664094.11 : Before running any preprocessing on the assembly: /share/adl/tdlong/peromyscus/Progr essive/genomes/mouse_all.fasta got following stats (assembly may be listed as temp file if input sequences from a directory): Inpu t-sample: /share/adl/tdlong/peromyscus/Progressive/genomes/mouse_all.fasta Total-sequences: 66 Total-length: 2730871774 Proportion -repeat-masked: 0.467471 ProportionNs: 0.028595 Total-Ns: 78088274 N50: 130694993 Median-sequence-length: 184189 Max-sequence-leng th: 195471971 Min-sequence-length: 1976 Got message from job at time: 1526664096.55 : Before running any preprocessing on the assembly: /share/adl/tdlong/peromyscus/Progr essive/genomes/rn5.fa got following stats (assembly may be listed as temp file if input sequences from a directory): Input-sample: /share/adl/tdlong/peromyscus/Progressive/genomes/rn5.fa Total-sequences: 2739 Total-length: 2909698938 Proportion-repeat-masked: 0.494283 ProportionNs: 0.115766 Total-Ns: 336845215 N50: 154597545 Median-sequence-length: 1018 Max-sequence-length: 290094216 Min -sequence-length: 280 Got message from job at time: 1526664101.15 : Before running any preprocessing on the assembly: /share/adl/tdlong/peromyscus/Progr essive/genomes/hg38.fa got following stats (assembly may be listed as temp file if input sequences from a directory): Input-sample : /share/adl/tdlong/peromyscus/Progressive/genomes/hg38.fa Total-sequences: 455 Total-length: 3209286105 Proportion-repeat-masked: 0.544857 ProportionNs: 0.049846 Total-Ns: 159970322 N50: 145138636 Median-sequence-length: 161218 Max-sequence-length: 248956422 Min-sequence-length: 970 Process Process-57: Process Process-43: Traceback (most recent call last): File "/data/apps/progressiveCactus/python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap Traceback (most recent call last): File "/data/apps/progressiveCactus/python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() self.run() File "/data/apps/progressiveCactus/python/lib/python2.7/multiprocessing/process.py", line 114, in run File "/data/apps/progressiveCactus/python/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, *self._kwargs) File "/data/apps/progressiveCactus/submodules/jobTree/batchSystems/singleMachine.py", line 51, in worker self._target(self._args, **self._kwargs) File "/data/apps/progressiveCactus/submodules/jobTree/batchSystems/singleMachine.py", line 51, in worker slaveMain() File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 371, in main truncateFile(tempSlaveLogFile) File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 38, in truncateFile if os.path.getsize(fileNameString) > tooBig: File "/data/apps/progressiveCactus/python/lib/python2.7/genericpath.py", line 57, in getsize slaveMain() File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 371, in main truncateFile(tempSlaveLogFile) File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 38, in truncateFile if os.path.getsize(fileNameString) > tooBig: File "/data/apps/progressiveCactus/python/lib/python2.7/genericpath.py", line 57, in getsize return os.stat(filename).st_size return os.stat(filename).st_size OSError: [Errno 2] No such file or directory: '/tmp/tmpibNTFi/slave_log.txt' OSError: [Errno 2] No such file or directory: '/tmp/tmpe6Be/slave_log.txt' Process Process-54: Traceback (most recent call last): etc….