lonelyjoeparker / qmul-genome-convergence-pipeline

API and binaries for phylogenomic analyses, particularly comparison of input trees/alignments (CONTEXT) and detecting genomic convergence
4 stars 2 forks source link

Failure in PAML methods in cluster implementation #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
*uk.ac.qmul.sbcs.evolution.convergence.runners.SimpleCongruenceTestRunnerAAWithB
inariesArgsPruningSimulation
*uk.ac.qmul.sbcs.evolution.convergence.runners.VerySimpleCongruenceAnalysisAAWit
hBinariesPruning

*When running either of the above main classes on the analysis cluster, runs 
seem to fail unpredictably.

*Instead of successful script execution with SSLS.out files being harvested, 
script fails:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 39, 
Size: 39
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at uk.ac.qmul.sbcs.evolution.convergence.util.stats.DataSeries.getValueAtPercentile(DataSeries.java:111)
        at uk.ac.qmul.sbcs.evolution.convergence.analyses.VerySimpleCongruenceAnalysisAAWithBinariesPruning.go(VerySimpleCongruenceAnalysisAAWithBinariesPruning.java:210)
        at uk.ac.qmul.sbcs.evolution.convergence.runners.VerySimpleCongruenceTestRunnerAAwithBinariesArgsPruning.main(VerySimpleCongruenceTestRunnerAAwithBinariesArgsPruning.java:47)
cp: cannot stat 
`/mnt/lustre_0/data/bioinfo/parker_comparative_gen//qENSG00000196323_ZBTB44_000_
22t.gb.fasta/*SSLS.out': No such file or directory
runJava.sh.e7421111 (END) 

*This is the percentile bin initialisation call 
(VerySimpleCongruenceAnalysisAAWithBinariesPruning.java:210):
    treeOnePercentiles[i] = treeOnelnL.getValueAtPercentile(intervals[i]);

*This is based on the lnf output file:
AAAAAAAAAAAAAAAAAAA,-3.0793245
AAAAAAAATAAAAAAAAAA,-9.235496
AAAAAAATAAAAAAAAAAA,-8.1014805
AAAAATAAAAAAAAAAAAA,-9.234972
CCCCCCCCCCCCCCCCCCC,-3.4246657
DDDDDDDDDDDDDDDDDDD,-3.2277539
EEEEEEEDEEEEEEEEEED,-13.326104
EEEEEEEEEEEEEEEEEEE,-2.9686382
FFFFFFFFFFFFFFFFFFF,-3.1835272
GGGGGGGGGGGGGGGGGGG,-3.3200476
GGGSGGGGGGGGGGGGGGG,-9.7324
GSGGGSSGSGGGGGGGVGS,-25.752645
HHHHHHHHHHHHHHHHHHH,-3.640338
IIIIIIIIIIIIIIIIIII,-3.0975497
IIIIIIIIIIIIIIIIIIV,-8.252373
IIIIIIIIIIVIIIIIIII,-8.94412
IIIIIIILILIIILIIIII,-14.711445
IIIIILIIIIIIIIIIIII,-8.920112
IIITIIIIIIIIIIIIIII,-10.033751
KKKKKKKKKKKKKKKKKKK,-3.0620947
LLLLLLLLLLLLLLLLLLL,-2.8253393
MMMMMMMMMMMMMMMMMMM,-3.5090034
NNNNNNNNNNNNNNNNNNN,-3.0301375
NNNNNNNSNDNNNNNNNNN,-13.5817795
NNNNNNNSSNNNNNNNNNN,-11.741971
PPPPPPPPPPPPPPPPPPP,-3.439486
PPPPPPPQPPPPPPPPPPP,-10.001085
QQQQQQQQQQQQQQQQQQQ,-3.5316145
RRRRRRRRRRRRRRRRRRR,-3.1233888
SSSGGSSSSSSSGSSSSSS,-10.51453
SSSSSSSNSSSSSSSSSSS,-6.800735
SSSSSSSPSSSSSSSSSSS,-8.206925
SSSSSSSSSSSSSSSSSSS,-2.203274
TTTTTTTTTTTTTTTTTTT,-2.8269684
VVVVVIILVVVVVVVVVVV,-18.612541
VVVVVVVVAVVVVVVVMVV,-16.448462
VVVVVVVVVVVVVVVVVVV,-2.6731076
WWWWWWWWWWWWWWWWWWW,-5.362063
YYYYYYYYYYYYYYYYYYY,-4.0220547

-1
201
39
39 39

*A successful run gives no error method from the preceeding lnf file:

trying to read site patterns' lnL from 
/mnt/lustre_0/data/bioinfo/parker_comparative_gen/qENSG00000198646_NCOA6_001_22t
.gb.fasta/lnf
trying to read site patterns' lnL from 
/mnt/lustre_0/data/bioinfo/parker_comparative_gen/qENSG00000198646_NCOA6_001_22t
.gb.fasta/lnf
BasicFileReader read 78 lines.
AAAAAAAAAAAAAAAAAAAAAA,-3.8907454
AAAAAAAAAAASAAAAAAAAAA,-7.9757957
AAAAAAAAAATAAAAAAAAAAA,-7.859626
AAAAAAAAAATVAAAAAAAAAA,-12.404156
AAAAAPAAAAAAAAAATAAAAA,-15.135917
AATSATAAAASAAAAAASAAAA,-24.02477
DDDDDDDDDDDDDDDDDDDDDD,-4.5745797
DDDDDDDDDDEDDDDDDDDDDD,-11.75444
FFFFFFFFFFFFFFFFFFFFFF,-3.8137336
GGGGGGGGDGGGGGGGGGGGGG,-9.726164
GGGGGGGGEGGGGGGGGGGGGG,-13.648058
GGGGGGGGGGGGGGGGGGGGGG,-2.5597928
GGGGGGGGGGNNGGGGGGGGGG,-8.499877
HHHHHHHHHHHHHHHHHHHHHH,-4.386964
IIIIIIIIIIIIIIIIIIIIII,-5.052157
KKKKKKKKKKKKKKKKKKKKKK,-6.030787
LLLLLLLLLLILLLLLLLLLLL,-9.102385
LLLLLLLLLLLLLLLLLLLLLL,-4.5820546
LLLLLLLLLLMLLLLLLLLLLL,-6.8478928
LMMMMMMMMMMMMMMMMMMMMM,-8.7924385
MIIIIIIIIIIIIIIIIIIIII,-9.548888
MIMMMMIMMMMMMMMMMMMMMM,-10.341057
MMIMMMMMMMMMMMMMMMMMMM,-8.937092
MMMLMMMMMMMMMMMMMMMMMM,-8.7956915
MMMMIMMMMMMMMMMMMMMMMM,-9.643981
MMMMMMMIMMMMMMMMMMMMMM,-10.345242
MMMMMMMMMMMMMLMMMMMMMM,-9.202817
MMMMMMMMMMMMMMMMMMLMMM,-8.789189
MMMMMMMMMMMMMMMMMMMMMM,-2.855346
MMMMMMMMMMMVMMMMMMMMMM,-7.696151
NNNNNNNNNNNNNNNNNNNNNN,-3.1810155
NNNNNNNNNNNSNNNNNNNNNN,-7.0849977
NNNNNNNNSNNNNNNNNNNNNN,-7.4562263
NNNNNNSNNNNNNNNNNNNNNN,-8.380901
NNNSNNNNSNNNNNNNNNNNNN,-12.250305
NNSNNNNNNNNNNNNNNNNNNS,-9.075439
NSNNNNNNNNNNNNNNNNNNNN,-7.975057
NSNNNSNNTSNNNNSNNNSNNS,-32.856766
PPPPPPPPPPAPPPPPPPPSPP,-11.780955
PPPPPPPPPPPPPPPPPPPPPP,-2.5061538
PPPPPPPPPPPPPPPPPPTPPP,-9.603517
PPPPPPPPPPSPPPPPPPPPPP,-6.1660933
PPPPPPPPPPSPSPPPSPPPPP,-12.987471
PPPPPPPPPPTPPPPPPPPPSP,-13.598078
PPPPPPPPPTPPPPPPPPPPPP,-10.702509
PPPPPPPSPPPSPPPSPPPPPP,-12.706645
PPPPPSPPPPAPPPPPPPPPPP,-11.781636
PPPPPSPPPPPPPPPPPPPPPP,-7.4907827
PPSPPPPPPPPPPPPPPPPPPP,-7.9130235
PQQQQQQQQQQQQQQQQQQQQQ,-8.417741
QQQQQQQQQQPQQQQQQQQQQQ,-6.272202
QQQQQQQQQQQAQQQQQQQQQQ,-8.45761
QQQQQQQQQQQQQQQQQQQQPQ,-8.833154
QQQQQQQQQQQQQQQQQQQQQQ,-2.5318978
RRRRRRRRRRRRRRRRRRRRRR,-5.035617
SSSSSPSSSSSSSSPSSSSPSS,-15.187174
SSSSSSNSSSSSSSSSNSSSSS,-13.609008
SSSSSSSSSASSSSPSSSAPSS,-19.340054
SSSSSSSSSSASSSSSSSSSSS,-6.8887844
SSSSSSSSSSGSSNSSSSNNSS,-20.51894
SSSSSSSSSSNNGSSSSSSSSS,-14.600571
SSSSSSSSSSNSSSSSSSSSSS,-6.033004
SSSSSSSSSSSSSSSSSSSSSN,-8.552961
SSSSSSSSSSSSSSSSSSSSSS,-3.518865
SSTTTSTTTTSATTTTTTTAST,-26.638514
TTTATTTTTTGSTTTTTTTTTT,-17.706703
TTTTTNTTTTTTTTTTTTTTTT,-8.677601
TTTTTTTTTTPTTTTTTTTTTT,-7.907708
TTTTTTTTTTTTTTTTTTTSTT,-8.0814085
TTTTTTTTTTTTTTTTTTTTTT,-4.1121874
VVMVVVVVVVVVVVVVVVVMVV,-12.708546
VVVVVVVVVVMVVVVVVVVVVV,-6.3951426
VVVVVVVVVVVVVVVVVVVVVV,-3.271133

-1
136
73
73 73

* Seems to occur after a successful lnf file read when trying to get percentile 
values from the DataSeries object. 
*Only seems to happen sequentially in listed files (see debug.xlsx); may be 
sufficient to apply a check. If this is a computational limit need to talk to 
cluster owner though.

Original issue reported on code.google.com by joeparke...@gmail.com on 28 Nov 2011 at 12:17

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by joeparke...@gmail.com on 28 Nov 2011 at 12:19

GoogleCodeExporter commented 9 years ago
This problem stems insufficient numbers of sites in the analysis, specifically, 
where the number of site patterns is significantly (how much?) fewer than the 
number of bins in the DataSeries object used to hold the data.

For the moment a fix of sorts has been applied, by catching all uninitialised 
bins at the end of the binning process (fall-through) and giving them a lnL of 
0. 

However long-term it is worth either:
a) Rejecting DataSeries intantiation calls with nPatterns ≤100, and/or;
b) Introducing a more flexible DataSeries constructor that assigns bin sizes 
based on an appropriate algorithm.

Original comment by joeparke...@gmail.com on 9 Jan 2012 at 4:52

GoogleCodeExporter commented 9 years ago
Subsequent revisions have corrected this.

Original comment by joeparke...@gmail.com on 30 Oct 2012 at 12:44