glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
80 stars 26 forks source link

kyoto tycoon network error #45

Open cooketho opened 9 years ago

cooketho commented 9 years ago

I'm trying to align 8 vertebrate genomes with progressiveCactus, but I've encountered an error that seems to be related to kyoto tycoon (see log file below). From what I can see from the comments in some of the C code from sonLib, this is a known issue: (https://github.com/benedictpaten/sonLib/blob/db59f3a8ede037b934c78691af78a85a1cd42ee9/C/impl/sonLibKVDatabase_BigRecordFile.c)

Should I try using non-default kyoto tycoon options, e.g. --ktCreateTuning or --ktOpenTuning?

Here is the command I ran: runProgressiveCactus.sh budgie_wga.txt budgie_wga budgie_wga/budgie_wga.hal --batchSystem gridEngine --retryCount 100 --stats --defaultMemory=8000000000 --maxMemory=1500000000000

log.txt: ---JOBTREE SLAVE OUTPUT LOG--- log.txt: Throwing a KT exception with the string network error log.txt: Exception: ST_KV_DATABASE_EXCEPTION: An unknown database error occurred when getting a bulk set of flowers log.txt: caused by: ST_KV_DATABASE_EXCEPTION: stKVDatabase_BulkGetRecords with 762 records failed log.txt: caused by: ST_KV_DATABASE_EXCEPTION: kyoto tycoon get bulk record failed: network error log.txt: Uncaught exception log.txt: Traceback (most recent call last): log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute log.txt: self.target.run() log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 777, in run log.txt: makeScaffolds=self.getOptionalPhaseAttrib("makeScaffolds", bool)) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/shared/common.py", line 363, in runCactusReference log.txt: popenPush(command, stdinString=flowerNames) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/sonLib/bioio.py", line 223, in popenPush log.txt: raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts)) log.txt: RuntimeError: Command: cactus_reference --cactusDisk ' log.txt: log.txt: log.txt: ' --logLevel CRITICAL --matchingAlgorithm blossom5 --referenceEventString Anc6 --permutations 10 --useSimulatedAnnealing --theta 0.0001 --maxWa lkForCalculatingZ 100000 --ignoreUnalignedGaps --wiggle 0.9999 --numberOfNs 10 --minNumberOfSequencesToSupportAdjacency 1 --makeScaffolds with stdin string '762 342498 7516615240792 562949953435316 169 44 79 11428 56 304 422212465038867 4739 384 993 1954 1903 873 122 757 422212465057192 1180 502 130 1275 82 168 1492 819 96 1548112371 909585 1145 967 271 229 107 422212465070038 5178 1125899906826090 333 38 108 1004 437 3537 141 149 138 141 281474976712003 120 38 70 703687441773859 422212465058331 10 1 1255 434 140737488349934 129 96 89 129 109 844424930130969 430 927 95 6578 44 71 26 75 93 273 2720 32 32 1407374883556101 3236962232184007 57 140737488338338 422212465061906 9703 1608 207 844424930129525 57 2955487255449189 127 5814 53 428 151 339 6270 449 969 342 147 1970324836970546 703687441773563 111 985162418478887 5938 103 29 51 38 220 83 1227 3008 217 96 2996 76 82 187 953 562949953406806 127 1335 47 5619 5357 960 691 1125899906827674 190 422212465084933 57 207 7009 985162418462306 255 11810 1125899906833185 817 170 75 1270 136 330 143 111 694 542 252 2948 335 140737488351299 4362862139019509 3236962232164138 63 1872 2489 94 1266637395190001 164 261 79 803 482 1377 4022 194 309 239 69 435 665 115 85 29 220 194 70 703687441774645 337 128 158 454 103 57 7543 304 4259 271 120 32 1548112371906725 413 38 1662 2955487255442892 3118 1216 17362 2135 162 257 69 133 63 192 351 815 152 122 985162418458587 67 90 270 1558 10786 111 249 285 330 3236962232157810 4914 175 127 1967 5210 169 83 4309 127 131 119 543 1862 109 63 1530 57 255 95 363 185 2451 928 335 79 2392537302014886 73 51 32 2115 610 83 422212465071238 96 5961 187 1688849860253845 71 50 10393 41 128 2392537302033302 281474976705175 1304 1615 764 26 32 4081387162297224 50 721 76 452 3205 151 497 162 464 102 69 57 90 103 935 200 1688849860261326 83 2858 47 147 1439 533 97 431 348 112 1407374883550963 447 79 165 2874 1548112371906989 32 114 233 122 2352 2356 2251799813680222 1529 562949953409630 126 962 1232 132 121 129 109 1971 44 26534 1829587348590350 32 13196 985162418487734 3762 633 90 77 64 70 4549 2814749767085040 189 202 11122 70 1407374883543674 164 41 1754 450 17550 985162418470489 82 97 32 5207287069147598 1125899906845034 985162418490878 422212465057324 148 6109 32 1829587348609530 189 113 665 226 422212465075069 422212465059583 98 105 38 44 140737488352036 2980 32 5159 703687441770193 175 415 422212465064393 95 844424930128652 560 2969 1829587348613607 484 99 12002 140737488352142 79 479 422212465077766 140737488352110 51 76 139 83 63 3940649673930933 93 2658 181 79 169 183 319 207 75 544 8175 134 121 96 110 713 73 66 414 79 41 44 201 226 1688849860247698 501 69 7971 5934 38 57 703687441777067 589 56 395 422212465055112 47 175 111 53 641 169 1122 1355 1259 109 615 70 281474976705786 751 2293 281474976719148 135 143 10884 121 236 131 1382 70 1688849860249271 985162418471553 26 135 60 228 422212465080829 294 44 9173 198 77 174 88 610 79 91 72 174 140737488342707 41 57 703687441773809 1125899906845556 26 77 142 132 140737488342340 508 26 125 12547 239 197 630 47 281474976701177 5099 109 351 41 26 83 2761 902 475 422212465068783 60 13547 844424930112792 145 1125899906836802 122 562949953418943 1210 155 148 79 287 95 156 356 197 4175 95 703687441795863 1125899906811540 192 3799912185630906 135 703687441739816 64 116 233 538 1391 32 1210 2516 73 98 204 38 1138 41 720 440 914 368 489 63 53 703687441769906 38 139 103 316 91 129 77 38 57 1497 562949953429511 60 6807 149 240 281474976695992 13342 164 370 1350 841 348 454 151 371 148 703687441755352 3528 65 50 73 51 148 66 38 216 323 2251799813682548 8721 2533274790381721 393 90 63 325 7222 140737488347312 8194 83 140737488352664 111 2899 109 985162418478757 212 317 26733 281474976685026 422212465071028 562949953424924 1829587348607459 54 7881 57 735 47 3863 63 873 376 77 164 1012 844424930122734 107 281474976705344 73 106 1686 1065 587 1024 2992 668 2210 841 422212465081365 226 4343 281474976680113 2790 619 57 265 17562 235 106 8127 167 26 38 243 908 1688849860236787 107 16289 1491 145 1266637395180276 111 95 281474976717553 562949953408570 95 76 887 203 63 140737488356061 70 32 2649 4604 793 168 96 844424930137687 12076 470 1064 776 32 32 159 2674012278720458 108 103 660 95 69 62 145 130 249 90 102 164 193 137 1550 138 117 85 69 1647 659 75 6673 103 112 62 422212465052360 70 8255 70 422212465069094 44 921 38 32 26 38 96 524 799 537 76 240 188 159 2143 208 164 1548112371891101 38 242 10601 133 422212465055237 392 43884 221 927 3377699720493164 164 1125899906851132 562949953417829 9324 202 844424930108945 8927 161 2888 140737488358913 703687441758184 73 38 232 50 2690 475 69 57 703687441782280 107 2251799813681161 654 109' exited with non-zero status 128

cooketho commented 8 years ago

Hi!

I'm still stuck with the "network error" message. Is there something I can do? Perhaps, I could help you reproduce the issue? Please let me know if there's anything I can do to help. For example is there a way to save the KT database to a file? I'm afraid I wouldn't know where to start as far as debugging this one, but it would be nice to be able to start at the step that failed, which is apparently querying the database. Thanks!

glennhickey commented 8 years ago

Hi,

The kyoto tycoon "network errors" are really tough to debug, as they seem to arise from a variety of different problems. The network errors mentioned in the file you linked to in your first message were due to timeouts writing large records to the database (and the workaround is to keep records > 10M on disk). Since your error is coming from a read, I don't think its too related. But you can lower this limit (to 5M?) (simplest way may just be editing default value and recompiling as the option is hidden from the runProgressiveCactus.sh interface it seems)

https://github.com/benedictpaten/sonLib/blob/master/C/impl/sonLibKVDatabaseConf.c#L184

Also: kyoto tycoon keeps log files for each server (1 per ancestral node). maybe there's a clue in one of these?

find | grep ktout.log

If you can pare down your alignment to just two species and reproduce, maybe we'll be able to quickly reproduce on our end..

On Tue, Nov 10, 2015 at 3:13 PM, cooketho notifications@github.com wrote:

Hi!

I'm still stuck with the "network error" message. Is there something I can do? Perhaps, I could help you reproduce the issue? Please let me know if there's anything I can do to help. For example is there a way to save the KT database to a file? I'm afraid I wouldn't know where to start as far as debugging this one, but it would be nice to be able to start at the step that failed, which is apparently querying the database. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/45#issuecomment-155552445 .

cooketho commented 8 years ago

Thanks for the advice! I pared it down to two species, and set the values for getXMLMaxKTRecordSize and getXMLMaxKTBulkSetSize to 5M, recompiled, and now I'm getting a slightly different error. Any thoughts?

log.txt: ---JOBTREE SLAVE OUTPUT LOG--- log.txt: Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.210.12.92 with error: network error log.txt: Uncaught exception log.txt: Traceback (most recent call last): log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute log.txt: self.target.run() log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 538, in run log.txt: self.runCactusCafInWorkflow(alignmentFile=self.phaseNode.attrib["alignments"]) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 504, in runCactusCafInWorkflow log.txt: realignArguments=self.getOptionalPhaseAttrib("realignArguments")) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/shared/common.py", line 239, in runCactusCaf log.txt: masterMessages = popenCatch(command, stdinString=flowerNames) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/sonLib/bioio.py", line 211, in popenCatch log.txt: raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts)) log.txt: RuntimeError: Command: cactus_caf --cactusDisk ' log.txt: log.txt: log.txt: ' --logLevel CRITICAL --alignments /srv/gsfs0/projects/bustamante/cooketho_projects/avian/genome/20150830_8_species_alignment/budgie_wga/jobTree/jobs/t1/gTD3/tmp_fois1pErnM/alignments.cigar --annealingRounds '128' --deannealingRounds '2 8' --trim '0 0' --minimumTreeCoverage 0.0 --blockTrim 5 --minimumDegree 2 --minimumIngroupDegree 2 --minimumOutgroupDegree 0 --singleCopyOutgroup --lastzArguments '--step=1 --ambiguous=iupac,100,100 --ydrop=3000 --identity=25.0' --minimumSequenceLengthForBlast 30 --maxAdjacencyComponentSizeRatio 50.0 --minLengthForChromosome 1000000 --proportionOfUnalignedBasesForNewChromosome 0.8 --maximumMedianSequenceLengthBetweenLinkedEnds 1000 --realign --realignArguments '--gapGamma 0.9 --diagonalExpansion 4 --splitMatrixBiggerThanThis 10 --constraintDiagonalTrim 0 --alignAmbiguityCharacters --splitIndelsLongerThanThis 99' with stdin string ' 1 0 ' exited with non-zero status 128

cooketho commented 8 years ago

I tried to align a subset of the chicken genome to a subset of the budgie genome to roughly define the limit of what progressiveCactus is able to do without having the "network error". Each of the two fasta files contains about 16 million lines. I simply used "head" to grab the first 4 million lines of each. The alignment completed successfully with 4M lines. OK great!

However, it failed with "network error" when I tried 6 million lines of each. This is just a hunch, but I would guess that there is a step invovling kyoto tycoon that requires a certain amount of memory, and jobTree is not giving the job enough. Part of the traceback in the error log is:

log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)

I set defaultMemory=8000000000 in runProgressiveCactus.sh, but maybe that's not enough? I guess I could increase that, but my understanding is not every step requires kyoto tycoon, and thus probably doesn't need more than 8G. Can't I instead specify that the jobs involving kyoto tycoon get more memory? Like can I use --ktCreateTuning or --ktOpenTuning to set bounds on the database size, and then have progressiveCactus pass those bounds to jobTree to modify the amount of memory requested per job so it's not just requesting the default all the time? My system has plenty of memory--that's not the problem. It's just that it uses a gridEngine job scheduler, and if I request like 100G for every job, they're going to remain in the queue a long time because I have to share with other users and can't just hog all the memory to myself. Any help/advice would be much appreciated. Thanks!

cooketho commented 8 years ago

This time I reduced the value of this variable: https://github.com/benedictpaten/sonLib/blob/master/C/impl/sonLibKVDatabaseConf.c#L184 to 1000000 and recompiled. I then set the KT tuning options to #bnum=30m#msiz=1000g. This worked with the 6M line genome subsets described in the previous post. When I tried to align the entire genomes the "network error" was gone, but a new error cropped up (see below). Please let me know if you have any suggestions.

log.txt: ---JOBTREE SLAVE OUTPUT LOG--- log.txt: Traceback (most recent call last): log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute log.txt: self.target.run() log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 538, in run log.txt: self.runCactusCafInWorkflow(alignmentFile=self.phaseNode.attrib["alignments"]) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 504, in runCactusCafInWorkflow log.txt: realignArguments=self.getOptionalPhaseAttrib("realignArguments")) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/cactus/shared/common.py", line 239, in runCactusCaf log.txt: masterMessages = popenCatch(command, stdinString=flowerNames) log.txt: File "/home/cooketho/bin/progressiveCactus/submodules/sonLib/bioio.py", line 211, in popenCatch log.txt: raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts)) log.txt: RuntimeError: Command: cactus_caf --cactusDisk ' log.txt: log.txt: log.txt: ' --logLevel CRITICAL --alignments /srv/gsfs0/projects/bustamante/cooketho_projects/avian/genome/20150830_8_species_alignment/budgie_wga/jobTree/jobs/t1/gTD3/tmp_JmOEmKvDOu/alignments.cigar --annealingRounds '128' --deannealingRounds '2 8' --trim '0 0' --minimumTreeCoverage 0.0 --blockTrim 5 --minimumDegree 2 --minimumIngroupDegree 2 --minimumOutgroupDegree 0 --singleCopyOutgroup --lastzArguments '--step=1 --ambiguous=iupac,100,100 --ydrop=3000 --identity=25.0' --minimumSequenceLengthForBlast 30 --maxAdjacencyComponentSizeRatio 50.0 --minLengthForChromosome 1000000 --proportionOfUnalignedBasesForNewChromosome 0.8 --maximumMedianSequenceLengthBetweenLinkedEnds 1000 --realign --realignArguments '--gapGamma 0.9 --diagonalExpansion 4 --splitMatrixBiggerThanThis 10 --constraintDiagonalTrim 0 --alignAmbiguityCharacters --splitIndelsLongerThanThis 99' with stdin string ' 1 0 ' exited with non-zero status -11

cooketho commented 8 years ago

Quick update on the latest post. On this line: https://github.com/benedictpaten/sonLib/blob/master/bioio.py#L210 There is a call to subprocess.Popen.wait() that has arguments stdin=subprocess.PIPE, stdout=subprocess.PIPE. The documentation for subprocess has the following warning for Popen.wait(): "This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that." Do you think this is why the popenCatch function is generating a RuntimeError with the big data set, but not the small one?

joannarifkin commented 6 years ago

Hi! I think I'm getting a similar error to this series, and was wondering if cooketho's steps solved the problem or if there's something else I should do. I'm trying to align a bad draft assembly of Rumex rothschildianus to a fairly distant species, Fagopyrum tataricum.

The kt.out log file looks normal: 2018-04-10T14:35:45.421114-05:00: [SYSTEM]: ================ [START]: pid=15358 2018-04-10T14:35:45.421272-05:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-04-10T14:35:45.421471-05:00: [SYSTEM]: starting the server: expr=142.150.214.91:1979 2018-04-10T14:35:45.421834-05:00: [SYSTEM]: server socket opened: expr=142.150.214.91:1979 timeout=200000.0 2018-04-10T14:35:45.421863-05:00: [SYSTEM]: listening server socket started: fd=13 2018-04-10T15:22:56.935211-05:00: [SYSTEM]: server stopped 2018-04-10T15:22:56.945276-05:00: [SYSTEM]: finishing the server 2018-04-10T15:22:56.955037-05:00: [SYSTEM]: closing the server socket 2018-04-10T15:22:57.455250-05:00: [SYSTEM]: closing a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p 2018-04-10T15:22:57.530174-05:00: [SYSTEM]: ================ [FINISH]: pid=15358

The log.txt in the t1 job file is: Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 142.150.214.91 with error: network error Uncaught exception Traceback (most recent call last): File "/ohta/joanna.rifkin/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) File "/ohta/joanna.rifkin/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute self.target.run() File "/ohta/joanna.rifkin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 590, in run self.runCactusCafInWorkflow(alignmentFile=self.phaseNode.attrib["alignments"]) File "/ohta/joanna.rifkin/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 556, in runCactusCafInWorkflow maxRecoverableChainLength=self.getOptionalPhaseAttrib("maxRecoverableChainLength", int)) File "/ohta/joanna.rifkin/progressiveCactus/submodules/cactus/shared/common.py", line 292, in runCactusCaf masterMessages = popenCatch(command, stdinString=flowerNames) File "/ohta/joanna.rifkin/progressiveCactus/submodules/sonLib/bioio.py", line 212, in popenCatch raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts)) RuntimeError: Command: cactus_caf --cactusDisk ' ' --logLevel CRITICAL --alignments /ohta/joanna.rifkin/progressiveCactus/JRFiles/4-9RothBuckwheat/jobTree/jobs/t1/gTD3/tmp_lAhWJfRCJM/alignments.cigar --annealingRounds '128' --deannealingRounds '2 8' --trim '0 0' --minimumTreeCoverage 0.0 --blockTrim 5 --minimumDegree 2 --minimumIngroupDegree 1 --minimumOutgroupDegree 0 --alignmentFilter relaxedSingleCopyOutgroup --lastzArguments '--step=1 --ambiguous=iupac,100,100 --ydrop=3000 --identity=25.0' --minimumSequenceLengthForBlast 30 --maxAdjacencyComponentSizeRatio 50.0 --minLengthForChromosome 1000000 --proportionOfUnalignedBasesForNewChromosome 0.8 --maximumMedianSequenceLengthBetweenLinkedEnds 1000 --realign --realignArguments '--gapGamma 0.0 --matchGamma 0.9 --diagonalExpansion 4 --splitMatrixBiggerThanThis 10 --constraintDiagonalTrim 0 --alignAmbiguityCharacters --splitIndelsLongerThanThis 99' --phylogenyNumTrees 30 --phylogenyRootingMethod 'bestRecon' --phylogenyScoringMethod 'reconCost' --phylogenyBreakpointScalingFactor 1.0 --phylogenySkipSingleCopyBlocks --phylogenyMaxBaseDistance 100 --phylogenyMaxBlockDistance 50 --phylogenyTreeBuildingMethod guidedNeighborJoining,splitDecomposition --phylogenyCostPerDupPerBase 0.00 --phylogenyCostPerLossPerBase 0.02 --referenceEventHeader 'Anc0' --phylogenyDoSplitsWithSupportHigherThanThisAllAtOnce 0.44 --numTreeBuildingThreads 2 --phylogeny --minimumBlockDegreeToCheckSupport 10 --minimumBlockHomologySupport 0.05 --removeRecoverableChains unequalNumberOfIngroupCopies --minimumNumberOfSpecies 1 --phylogenyHomologyUnitType 'chain' --phylogenyDistanceCorrectionMethod 'jukesCantor' --maxRecoverableChainsIterations 5 --maxRecoverableChainLength 500000 with stdin string ' 1 0 ' exited with non-zero status 128 Exiting the slave because of a failed job on host esc3055m-big-wright.eeb.utoronto.ca

And the cactus.log claims that the alignment finished (which it clearly didn't).

Thoughts?