statisticalbiotechnology / quandenser

QUANtification by Distillation for ENhanced Signals with Error Regulation
Apache License 2.0
9 stars 1 forks source link

Dinosaur index error #17

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

Over the weekend the rerun (now 16 Gb of RAM allocated to dinosaur) hit a snag with an Array Index error. I'm sure it's not your fault and I'll rerun, in the hope it doesn't recur. I think there are still about another 20-30 links to process.

Best wishes,

Andrew

java -Xmx16G -jar "C:\Program Files\quandenser-v0-02\share\java/Dinosaur-1.1.3.free.jar" --force  --profiling=true --nReport=0  --concurrency=11 --seed=1 --outDir=.\quandenser_output_only_samples_noOX1\percolator/search_and_link_124_to_122.psms.pout_dinosaur --advParams="C:\Program Files\quandenser-v0-02\share\java/advParams_dinosaur_targeted.txt" --mode=target --targets=.\quandenser_output_only_samples_noOX1\percolator/search_and_link_124_to_122.psms.pout.dinosaur_targets.tsv E:\RAW\Lab2\Pilot_2\FL948_MSQ1388_20180605_SM_163.mzML
Dinosaur 1.1.3    built:${maven.build.timestamp}
  mzML file: E:\RAW\Lab2\Pilot_2\FL948_MSQ1388_20180605_SM_163.mzML
    out dir: .\quandenser_output_only_samples_noOX1\percolator/search_and_link_124_to_122.psms.pout_dinosaur
   out name: FL948_MSQ1388_20180605_SM_163

java.lang.ArrayIndexOutOfBoundsException: 1
        at se.lth.immun.TargetFile$$anonfun$parse$1.apply(TargetFile.scala:87)
        at se.lth.immun.TargetFile$$anonfun$parse$1.apply(TargetFile.scala:71)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at se.lth.immun.TargetFile.parse(TargetFile.scala:71)
        at se.lth.immun.TargetFile.read(TargetFile.scala:50)
        at se.lth.immun.TargetFile.read(TargetFile.scala:34)
        at se.lth.immun.Dinosaur$.main(Dinosaur.scala:55)
        at se.lth.immun.Dinosaur.main(Dinosaur.scala)
.                              .
[==============================]all hills, n=664071
hill checkSum = 160619692805339
peaky hills, n=664071
peaky hill checkSum = 160619692805339
  nScans    nHills
         2        0
         3   155019
         4    87664
      5-10   194913
     10-20   141717
     20-50    74336
    50-100     8003
   100-200     2419
   200-500        0
  500-1000        0
 1000-2000        0
 2000-5000        0
5000-10000        0
    >10000        0
writing hill reports...
hill reports written
=== IN TARGETED MODE ===
andrewjmc commented 4 years ago

I have confirmed that rerunning the dinosaur command gets the same error.

I think the problem is that the targets file (search_and_link_124_to_122.psms.pout.dinosaur_targets.tsv) is empty:

mz  charge  mzDiff  rtStart rtEnd   minApexInt  id
0

In the run up o the dinosaur run, everything seemed OK:

Matching features 124->122 (462/482)
Features:
ppmDiff rtDiff precMz rTime queryIsPlaceHolder targetIsPlaceHolder charge1 charge2 charge3 charge4plus
Percolator version 3.02.1, Build Date Jan 16 2020 07:33:00
Copyright (c) 2006-9 University of Washington. All rights reserved.
Written by Lukas Käll (lukall@u.washington.edu) in the
Department of Genome Sciences at the University of Washington.
Issued command:
percolator --only-psms --post-processing-tdc input_file_placeholder --trainFDR 0.02 --testFDR 0.02 --results-psms .\quan
denser_output_only_samples_noOX1\percolator/link_124_to_122.psms --decoy-results-psms .\quandenser_output_only_samples_n
oOX1\percolator/link_124_to_122.psms.decoys
Started Sat Jan 25 01:32:13 2020
Hyperparameters: selectionFdr=0.02, Cpos=0, Cneg=0, maxNiter=10
FeatureNames::getNumFeatures(): 10
Train/test set contains 1560337 positives and 1083859 negatives, size ratio=1.43961 and pi0=1
Selecting Cpos by cross-validation.
Selecting Cneg by cross-validation.
Found 113958 test set positives with q<0.02 in initial direction
Reading in data and feature calculation took 10.76 cpu seconds or 11 seconds wall clock time.
---Training with Cpos selected by cross validation, Cneg selected by cross validation, initial_fdr=0.02, fdr=0.02
Iteration 1:    Estimated 127448 PSMs with q<0.02
Iteration 2:    Estimated 137562 PSMs with q<0.02
Iteration 3:    Estimated 145186 PSMs with q<0.02
Iteration 4:    Estimated 151166 PSMs with q<0.02
Iteration 5:    Estimated 155651 PSMs with q<0.02
Iteration 6:    Estimated 158503 PSMs with q<0.02
Iteration 7:    Estimated 160781 PSMs with q<0.02
Iteration 8:    Estimated 162627 PSMs with q<0.02
Iteration 9:    Estimated 164150 PSMs with q<0.02
Iteration 10:   Estimated 165361 PSMs with q<0.02
Learned normalized SVM weights for the 3 cross-validation splits:
 Split1  Split2  Split3 FeatureName
-25.6049        -25.3290        -24.9957        ppmDiff
-2.1010 -2.0142 -2.0114 rtDiff
-0.0281 -0.0343 -0.0294 precMz
-0.0914 -0.0870 -0.0877 rTime
-0.7556 -0.7668 -0.7499 queryIsPlaceHolder
 0.0105  0.0081  0.0081 targetIsPlaceHolder
 0.1344  0.1286  0.1287 charge1
-0.0431 -0.0448 -0.0434 charge2
-0.0487 -0.0462 -0.0485 charge3
-0.0162 -0.0102 -0.0095 charge4plus
-32.1126        -31.6627        -31.2687        m0
Found 165368 test set PSMs with q<0.02.
Selected best-scoring PSM per scan+expMass (target-decoy competition): 1115252 target PSMs and 706656 decoy PSMs.
Calculating q values.
Final list yields 149751 target PSMs with q<0.02.
Calculating posterior error probabilities (PEPs).
Processing took 332.5 cpu seconds or 333 seconds wall clock time.
Links before 0
Links after 164860

The PSMs file (link_124_to_122.psms) is 61 Mb. The decoys file (link_124_to_122.psms.decoys) is very small and I'm unsure why. They are usually only a little smaller than the PSMs files.

PSMId   score   q-value posterior_error_prob    peptide proteinIds
193678_868.657532_51.7893143_1799528_1  0.136907    0.000793021 0.00403301  A.154146_868.657471_51.7787857_0_1.A
68976_1376.31238_61.6550255_702506.75_1 0.130283    0.00115075  0.00477215  A.68491_1376.31226_61.6410866_1197500.75_1.A
67510_1271.77832_78.6169662_59034.7578_1    0.129991    0.00152964  0.00480767  A.66934_1271.77844_78.6167526_109898.078_1.A
52868_861.40033_55.5789948_1105606.63_4 0.129954    0.00178444  0.00481215  A.106039_861.40033_55.5467949_0_4.A
25607_619.709656_79.7467804_293563.188_3    0.129224    0.00203943  0.00490228  A.140139_619.709656_79.7593689_0_3.A

Advice gratefully appreciated!

Andrew

MatthewThe commented 4 years ago

That's very strange indeed, according to the logs there should be 706656 lines in the decoy file. Could it be that you're running out of disk space?

andrewjmc commented 4 years ago

Great guess. Absolutely right. Should have checked, feeling sheepish!