statisticalbiotechnology / quandenser-pipeline

A nextflow/singularity pipeline for quandenser
Apache License 2.0
5 stars 1 forks source link

Warnings in stdout #28

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

I have got closer than before to successful output, but typically the HPC randomly restarted my job at the last part of extracting consensus spectra. Nonetheless, I still have the feature groups file.

However, I note some errors in stdout:

Could not open matrix file Quandenser_output/maracluster_extra_features/poisoned.pvalues.dat
Starting MinHeap clustering
Could not read edges from input file.
Reading in p-value tree.
Reading in p-value tree.
WARNING: Empty pvalue tree file Quandenser_output/maracluster_extra_features/overlap.pvalue_tree.tsv
Writing clusterings for 1 thresholds.
Writing clusters to Quandenser_output/maracluster_extra_features/MaRaCluster.clusters_p10.tsv

These are the files in maracluster_extra_features:

350.dat (489518100)
350.dat.pvalue_tree.tsv (41830270)
350.dat.pvalue_vectors.head.dat (248)
350.dat.pvalue_vectors.tail.dat (248)
MaRaCluster.clusters_p10.tsv (220753532)
MaRaCluster.dat_file_list.txt (53)
MaRaCluster.peak_counts.dat (30071665)
MaRaCluster.scan_info.dat (67015056)
overlap.pvalue_tree.tsv (0)
Quandenser.spectrum_to_precursor_map.dat (196049160)

Does this suggest an important problem? I cannot find any other mention of warning in stdout.

Thanks,

Andrew

andrewjmc commented 4 years ago

The early poisoned clustering job seemed to work fine:

Starting poisoned clustering job: batches 0-3
Clustering 111888 pvalues
Starting MinHeap clustering
  Loading new edges.
It. 0: minRow = 25 34704, minCol = 48 27635, minEl = -54.1359, edgesLeft = 111748
Finished MinHeap clustering
  Elapsed time: 0.116912 cpu seconds or 0 min 1 sec wall time.
Retained 0 pvalues
Finished calculating pvalues.
  Elapsed time: 505.995 cpu seconds or 2 min 28 sec wall time.
  Estimated time remaining: 0 min 0 sec wall time.
Starting p-value clustering.
Estimated 142 p-values in this file
Writing 1 part files
  Elapsed time: 0.000451 cpu seconds or 0 min 0 sec wall time.
Sorting and filtering bin 1/1 (100%)
  Elapsed time: 0.371598 cpu seconds or 0 min 0 sec wall time.
Writing p-value 142/142 (100%)
  Elapsed time: 0.372252 cpu seconds or 0 min 0 sec wall time.
Starting MinHeap clustering
  Creating cluster membership map.
  Updating incomplete edges (0).
  Loading new edges.
  Sorting 142 edges.
  Adding new edges.
  Loaded new edges: new: 142, total: 142/142 (100%).
It. 0: minRow = 27 25122, minCol = 35 22537, minEl = -35.5161
Finished MinHeap clustering
  Elapsed time: 0.515618 cpu seconds or 0 min 0 sec wall time.
Reading in p-value tree.
Reading in p-value tree.
Writing clusterings for 1 thresholds.
Writing clusters to Quandenser_output/maracluster/MaRaCluster.clusters_p10.tsv
clust_size      #clusters       #spectra
1       2387176 2387176
2-3     187856  437121
4-7     79698   400119
8-15    31729   332478
16-31   12639   270232
32-63   4975    213152
64-127  1313    108475
128-255 191     31223
256-511 23      7695
512+    1       770
total   2705601 4188441

Finished writing clusterings.
Running MaRaCluster took: 10625.3 cpu seconds or 540 seconds wall time

The second one looks different:

Starting poisoned clustering job: batches 0-4
Clustering 1601077 pvalues
Starting MinHeap clustering
  Loading new edges.
It. 0: minRow = 38 66599, minCol = 43 52780, minEl = -57.1395, edgesLeft = 1601076
It. 10000: minRow = 3000000000 9273, minCol = 37 25352, minEl = -14.3581, edgesLeft = 761705
Finished MinHeap clustering
  Elapsed time: 3.41774 cpu seconds or 0 min 4 sec wall time.
Retained 0 pvalues
Finished calculating pvalues.
  Elapsed time: 869.539 cpu seconds or 3 min 37 sec wall time.
  Estimated time remaining: 0 min 0 sec wall time.
Starting p-value clustering.
  Elapsed time: 8e-06 cpu seconds or 0 min 0 sec wall time.
  Elapsed time: 0.194495 cpu seconds or 0 min 0 sec wall time.
  Elapsed time: 0.195684 cpu seconds or 0 min 0 sec wall time.
MatthewThe commented 4 years ago

This is perfectly fine, the overlap.pvalues.dat is only produced if the precursor are split up into multiple input files. As you only have a single precursor range (350.dat) this is the expected behavior, though I will remove the warning message for this case.

andrewjmc commented 4 years ago

Great, thanks!