statisticalbiotechnology / quandenser

QUANtification by Distillation for ENhanced Signals with Error Regulation
Apache License 2.0
9 stars 1 forks source link

Missing output files #18

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

After several reruns due to running out of memory and disk space, processing 241 RAW files, quandenser has completed. However, I seem to be missing critical output.

maracluster_extra_features directory contains only:

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       29/01/2020     23:03      568824800 199.dat
-a----       29/01/2020     23:03      507717100 701.dat
-a----       29/01/2020     23:03      438722200 956.dat
-a----       29/01/2020     23:03            225 MaRaCluster.dat_file_list.txt
-a----       29/01/2020     21:48       34422937 MaRaCluster.peak_counts.dat
-a----       29/01/2020     23:03      150719296 MaRaCluster.scan_info.dat

I am missing the clusters_p10.tsv file, among others. There is also crucially no Quandenser.feature_groups.tsv in the output root.

I cannot see that any errors occurred during processing (see tail of log attached). Notably, all the dinosaur files, the first round of clustering and all but two percolator-dinosaur runs were already on disk.

Help gratefully appreciated.

log3.txt

MatthewThe commented 4 years ago

Is that the last part of the log file? If so, it seems that the clustering run did not finish yet, which also explains why neither a clusters_p10.tsv nor Quandenser.feature_groups.tsv is produced. It could be that we ran out of memory again, especially if all MS1 features are still loaded in memory at the same time.

andrewjmc commented 4 years ago

Yes, that's the last part. Strange it didn't give an error, like usual. I'll give it another go later, but using the machine for another purpose at the moment.

andrewjmc commented 4 years ago

Interestingly on a rerun it crashed at exactly the same point (though seemed to have been running much faster). I wasn't there up to the end, but memory usage was high but seemed stable. Is this process using more memory as spectra are added? Would you expect an error-free crash if the quandenser binary ran out of memory?

Successfully inserted spectrum 5300001/5688248 (93%)
  Elapsed time: 7613.59 cpu seconds or 126 min 53 sec wall time.
  Estimated time remaining: 9 min 17 sec wall time.

It feels like I'm nearly there, but still some barriers to overcome.

Thanks for your help,

Andrew

MatthewThe commented 4 years ago

The process does use more memory as spectra are added, but for 5688248 spectra this should be below 1GB in total. I would expect that the OS will raise an error if it runs out of memory, but I'm not 100% sure what happens under windows.

If I'm not mistaken, you should be able to run maracluster (installers are here) separate from quandenser, once the .dat files have been created (which is probably also the reason that it ran much faster the second time around): maracluster batch --splitMassChargeStates --batch .\quandenser_files_only_samples_noOX1.txt --clusterThresholds -10.0 --pvalThreshold -10.0 --output-f older .\quandenser_output_only_samples_noOX1\maracluster_extra_features. If that runs through, you should be able to run quandenser again, which should then pick up the results from this maracluster run.

If that also fails at the same point, please share the maracluster_extra_features with me and I'll have a look.

andrewjmc commented 4 years ago

Some progress! Maracluster run worked, and Quandenser picked up the output;

However, a different error now during a a later step:

Finished writing clusterings.
Running MaRaCluster took: 150.928 cpu seconds or 151 seconds wall time
Serialized spectrum to precursor map
...
Processing pair 234/482: 174->202 (2494632 links)
Processing pair 235/482: 177->202 (160302 links)
Processing pair 236/482: 179->202 (76416 links)
Processing pair 237/482: 187->202 (158450 links)
Processing pair 238/482: 194->202 (85212 links)
Processing pair 239/482: 198->202 (59251 links)
Processing pair 240/482: 213->202 (248460 links)
Processing pair 241/482: 228->202 (73042 links)
Processing pair 242/482: 202->66 (5269417 links)
Processing pair 243/482: 202->86 (5269417 links)
Processing pair 244/482: 202->94 (5269417 links)
Processing pair 245/482: 202->102 (5269417 links)
Processing pair 246/482: 202->170 (5269417 links)
Processing pair 247/482: 202->174 (5269417 links)
Processing pair 248/482: 202->177 (5269417 links)
Processing pair 249/482: 202->179 (5269417 links)
Processing pair 250/482: 202->187 (5269417 links)
Processing pair 251/482: 202->194 (5269417 links)
Processing pair 252/482: 202->198 (5269417 links)
Processing pair 253/482: 202->213 (5269417 links)
Exception caught: bad allocation

There is plenty of space on the drive.

MatthewThe commented 4 years ago

That message usually indicates that there isn't enough memory. I'll have a look if there is any way to either use less memory in this step or to free up objects that are no longer needed.

andrewjmc commented 4 years ago

Thanks!

MatthewThe commented 4 years ago

There were no obvious places to optimize memory, but I have to think a bit more if there's a way to restructure the code to bring down memory consumption.

I also did some calculations on the expected memory consumption and my estimation comes down to about 230GB of RAM necessary to finish the entire process. By the fact that you could get through all alignments, I'm guessing you are already using virtual memory (i.e. a swap file). One thing to try would be to temporarily increase this limit, if you have the permissions to do so.

andrewjmc commented 4 years ago

OK. In doing this I spotted that the paging file is on spinning metal rather than the SSD. I will shift it and increase from 200,000 Mb to 250,000 Mb and see how this helps. With the 64 Gb RAM this should suffice.

Thanks.

andrewjmc commented 4 years ago

OK, so with the extra paging on the SSD, I've got further... but

Processing pair 358/482: 168->159 (5302630 links)
Processing pair 359/482: 180->176 (5352035 links)
Processing pair 360/482: 180->181 (5352035 links)
Processing pair 361/482: 185->81 (5613411 links)
Processing pair 362/482: 185->97 (5613411 links)
Processing pair 363/482: 185->104 (5613411 links)
Processing pair 364/482: 185->108 (5613411 links)
Processing pair 365/482: 185->148 (5613411 links)

...progress seems to have ground to a halt and I'm not convinced there's been any move in 24 hours. Memory usage is at 63 Gb RAM + 301/354 Gb comitted (I assume that means 238 Gb in the paging file). There is space on the drive.

Quandenser is only using 0.1-0.3% of the CPU (12 cores), which seems very low. I cannot see any errors.

What do you think?

MatthewThe commented 4 years ago

I'm guessing that the paging has slowed things down too much, probably the std::map instance of ~70GB does not work so well on disk... We should actually be able to write this map to disk and only load parts of it in memory at a time. I'll see if I can implement this in the next few days.

andrewjmc commented 4 years ago

Being an SSD I hoped the performance hit wouldn't be so severe. Thanks for your help.

MatthewThe commented 4 years ago

I created an updated version in which you should be able to reduce the memory usage by adding the new flag --use-tmp-files: quandenser-v0-02-win64-msvc-x64.zip

Normally the SSD virtual memory performance hit is indeed limited, but I suspect that accessing the gigantic map object resulted in memory thrashing. With the updated version this should no longer be the case, even without the --use-tmp-files flag. I'm working on writing some other parts of the process to disk as well, which should hopefully reduce the total memory to within 64GB.

andrewjmc commented 4 years ago

Rerunning now (from the same set of files). I can see the temporary files have been created (only around a Gb so far). Quandenser is now at pair 242/482 of serializing spectrum to precursor maps. RAM usage is 63.1/63.9 Gb with 249/354 Gb committed.

Quandenser is using maximum 0.7% CPU, so still seems to be limited by throughput elsewhere. I can see the hard disk is running at what Windows reports as 100% of throughput (~100 Mb/s write).

I now realise that the E:\ drive is not the SSD, as I believed. The SSD is only 250 Gb, and has only just over 100 Gb free (Windows is on it). Looks like there wouldn't be space to put the pagefile there.

MatthewThe commented 4 years ago

I added functionality to write even more intermediate files to disk, this should hopefully keep total memory below ~80GB: quandenser-v0-02-win64-msvc-x64.zip

However, the 0.7% maximum CPU could suggest that there is a different problem. Is there still progress?

andrewjmc commented 4 years ago

Thanks. Over 8 hours, it's progressed to 409/482, so that's just over 20 an hour. Quandenser CPU usage sitting at ~1.5%. Links per pair have risen from ~100k per pair to ~650k, and I suspect will increase a little more, so imagining the last 73 might take another 8-12 hours.

andrewjmc commented 4 years ago

If it gets through this by the end of the day I'll leave it running, otherwise I'll start with the new version.

andrewjmc commented 4 years ago

Phew! Now writing feature groups. I'm expecting 5.5 million, I think (the number of placeholders. Best case scenario 10,000 per hour = 23 days. Eek!

RAM used is only 32 Gb now, but still 270 Gb paged.

If I cancel and rerun with the newest version, will it redo the 482 pair processing steps, or have these been written to disk?

Thanks!

MatthewThe commented 4 years ago

Unfortunately it will have to redo the clustering of the 482 pairs, what is written to disk is just auxiliary information..

The number of feature groups will indeed be around 5.5 million, so if the speed of processing the feature groups does not increase we might have to come up with a different solution.

andrewjmc commented 4 years ago

Hi Matthew,

Running with the new version, leaving all of the files from previous runs, has failed. I tried twice, each time with

...
Processing pair 480/482: 121->127 (5541168 links)
Processing pair 481/482: 127->143 (5538564 links)
Processing pair 482/482: 143->119 (5538964 links)
Exception caught: bad allocation

It was the same point both times. I was observing the machine the first time round and there appeared to be plenty or RAM and swap space free.

Is this because I need to remove some of the intermediate files from previous runs with the former versions?

Thanks

MatthewThe commented 4 years ago

The intermediate files should not be a problem. Some things that may help me debug this:

  1. Are there still files in the Quandenser_output/tmp/featureToGroupMaps/ folder or is it empty?
  2. What is the combined size of the Quandenser_output/tmp/matchFeatures/features.<x>.dat files?
  3. Is there a file called Quandenser_output/maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat and if so, what size is it?

Is getting to this point at least going faster than previously?

andrewjmc commented 4 years ago

Yes, it is definitely going faster. It failed overnight, so less than 12 hours.

  1. Quandenser_output/tmp/featureToGroupMaps/ is empty
  2. Total size of Quandenser_output/tmp/matchFeatures/features.<x>.dat is 62 Gb (848 Kb to 214 Mb each)
  3. There is no Quandenser_output/maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat

The drive has 370 Gb free, if this helps.

Thanks again

MatthewThe commented 4 years ago

Thanks for the info, I'll take a look at what could be going wrong.

andrewjmc commented 4 years ago

Thanks

MatthewThe commented 4 years ago

I'm still not quite sure why it ran out of memory, but I did some optimization of the memory allocation and added some logging messages: quandenser-v0-02-win64-msvc-x64.zip

If this does not work, I think I might be able to drastically reduce the memory footprint by not loading in the "placeholder features", but this will require some more significant restructuring of the code.

andrewjmc commented 4 years ago

Thanks, rerun started immediately.

andrewjmc commented 4 years ago

It's progressed well without error and is now writing feature groups. This step remains slow: 5.5 million to write, and it's progressing by 20,000 in 38 minutes, suggesting it will take a week to complete. As long as it works, I'll manage this!

andrewjmc commented 4 years ago

OK, it finished this morning, so quicker than a week! I now have a 27 Gb feature group file, 625 Mb of MaRaCluster clusters and 8.5 Gb of consensus spectra to enjoy!

I received errors during conensus spectra generation:

Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_0_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_1_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_2_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_3_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_4_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_5_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_6_tmpfile.ms2: Permission denied

I can see the files and delete them, so this isn't a problem. However, I note that the tmpfiles go to part 47, but consensus spectra were only made up to part 16. Is this right?

Merging spectra in bin 48/48
  Processing consensus spectra.
  Batch 1/7
  Batch 2/7
  Batch 3/7
  Batch 4/7
  Batch 5/7
  Batch 6/7
  Batch 7/7
  Merging spectra
Warning: Can't remove...
Creating merged MSData file
Writing file .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part16.ms2
Finished writing file .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part16.ms2
Finished merging clusters
Running Quandenser took: 332495 cpu seconds or 332491 seconds wall time

Thanks for your help getting this through. I think the analysis will keep me quite busy now!

MatthewThe commented 4 years ago

Excellent! It is indeed expected that the tmpfiles go to higher part numbers, so nothing to worry about there.

andrewjmc commented 4 years ago

Superb! Thanks