Closed andrewjmc closed 4 years ago
Is that the last part of the log file? If so, it seems that the clustering run did not finish yet, which also explains why neither a clusters_p10.tsv
nor Quandenser.feature_groups.tsv
is produced. It could be that we ran out of memory again, especially if all MS1 features are still loaded in memory at the same time.
Yes, that's the last part. Strange it didn't give an error, like usual. I'll give it another go later, but using the machine for another purpose at the moment.
Interestingly on a rerun it crashed at exactly the same point (though seemed to have been running much faster). I wasn't there up to the end, but memory usage was high but seemed stable. Is this process using more memory as spectra are added? Would you expect an error-free crash if the quandenser binary ran out of memory?
Successfully inserted spectrum 5300001/5688248 (93%)
Elapsed time: 7613.59 cpu seconds or 126 min 53 sec wall time.
Estimated time remaining: 9 min 17 sec wall time.
It feels like I'm nearly there, but still some barriers to overcome.
Thanks for your help,
Andrew
The process does use more memory as spectra are added, but for 5688248 spectra this should be below 1GB in total. I would expect that the OS will raise an error if it runs out of memory, but I'm not 100% sure what happens under windows.
If I'm not mistaken, you should be able to run maracluster (installers are here) separate from quandenser, once the .dat
files have been created (which is probably also the reason that it ran much faster the second time around): maracluster batch --splitMassChargeStates --batch .\quandenser_files_only_samples_noOX1.txt --clusterThresholds -10.0 --pvalThreshold -10.0 --output-f older .\quandenser_output_only_samples_noOX1\maracluster_extra_features
. If that runs through, you should be able to run quandenser again, which should then pick up the results from this maracluster run.
If that also fails at the same point, please share the maracluster_extra_features
with me and I'll have a look.
Some progress! Maracluster run worked, and Quandenser picked up the output;
However, a different error now during a a later step:
Finished writing clusterings.
Running MaRaCluster took: 150.928 cpu seconds or 151 seconds wall time
Serialized spectrum to precursor map
...
Processing pair 234/482: 174->202 (2494632 links)
Processing pair 235/482: 177->202 (160302 links)
Processing pair 236/482: 179->202 (76416 links)
Processing pair 237/482: 187->202 (158450 links)
Processing pair 238/482: 194->202 (85212 links)
Processing pair 239/482: 198->202 (59251 links)
Processing pair 240/482: 213->202 (248460 links)
Processing pair 241/482: 228->202 (73042 links)
Processing pair 242/482: 202->66 (5269417 links)
Processing pair 243/482: 202->86 (5269417 links)
Processing pair 244/482: 202->94 (5269417 links)
Processing pair 245/482: 202->102 (5269417 links)
Processing pair 246/482: 202->170 (5269417 links)
Processing pair 247/482: 202->174 (5269417 links)
Processing pair 248/482: 202->177 (5269417 links)
Processing pair 249/482: 202->179 (5269417 links)
Processing pair 250/482: 202->187 (5269417 links)
Processing pair 251/482: 202->194 (5269417 links)
Processing pair 252/482: 202->198 (5269417 links)
Processing pair 253/482: 202->213 (5269417 links)
Exception caught: bad allocation
There is plenty of space on the drive.
That message usually indicates that there isn't enough memory. I'll have a look if there is any way to either use less memory in this step or to free up objects that are no longer needed.
Thanks!
There were no obvious places to optimize memory, but I have to think a bit more if there's a way to restructure the code to bring down memory consumption.
I also did some calculations on the expected memory consumption and my estimation comes down to about 230GB of RAM necessary to finish the entire process. By the fact that you could get through all alignments, I'm guessing you are already using virtual memory (i.e. a swap file). One thing to try would be to temporarily increase this limit, if you have the permissions to do so.
OK. In doing this I spotted that the paging file is on spinning metal rather than the SSD. I will shift it and increase from 200,000 Mb to 250,000 Mb and see how this helps. With the 64 Gb RAM this should suffice.
Thanks.
OK, so with the extra paging on the SSD, I've got further... but
Processing pair 358/482: 168->159 (5302630 links)
Processing pair 359/482: 180->176 (5352035 links)
Processing pair 360/482: 180->181 (5352035 links)
Processing pair 361/482: 185->81 (5613411 links)
Processing pair 362/482: 185->97 (5613411 links)
Processing pair 363/482: 185->104 (5613411 links)
Processing pair 364/482: 185->108 (5613411 links)
Processing pair 365/482: 185->148 (5613411 links)
...progress seems to have ground to a halt and I'm not convinced there's been any move in 24 hours. Memory usage is at 63 Gb RAM + 301/354 Gb comitted (I assume that means 238 Gb in the paging file). There is space on the drive.
Quandenser is only using 0.1-0.3% of the CPU (12 cores), which seems very low. I cannot see any errors.
What do you think?
I'm guessing that the paging has slowed things down too much, probably the std::map instance of ~70GB does not work so well on disk... We should actually be able to write this map to disk and only load parts of it in memory at a time. I'll see if I can implement this in the next few days.
Being an SSD I hoped the performance hit wouldn't be so severe. Thanks for your help.
I created an updated version in which you should be able to reduce the memory usage by adding the new flag --use-tmp-files
:
quandenser-v0-02-win64-msvc-x64.zip
Normally the SSD virtual memory performance hit is indeed limited, but I suspect that accessing the gigantic map object resulted in memory thrashing. With the updated version this should no longer be the case, even without the --use-tmp-files
flag. I'm working on writing some other parts of the process to disk as well, which should hopefully reduce the total memory to within 64GB.
Rerunning now (from the same set of files). I can see the temporary files have been created (only around a Gb so far). Quandenser is now at pair 242/482 of serializing spectrum to precursor maps. RAM usage is 63.1/63.9 Gb with 249/354 Gb committed.
Quandenser is using maximum 0.7% CPU, so still seems to be limited by throughput elsewhere. I can see the hard disk is running at what Windows reports as 100% of throughput (~100 Mb/s write).
I now realise that the E:\ drive is not the SSD, as I believed. The SSD is only 250 Gb, and has only just over 100 Gb free (Windows is on it). Looks like there wouldn't be space to put the pagefile there.
I added functionality to write even more intermediate files to disk, this should hopefully keep total memory below ~80GB: quandenser-v0-02-win64-msvc-x64.zip
However, the 0.7% maximum CPU could suggest that there is a different problem. Is there still progress?
Thanks. Over 8 hours, it's progressed to 409/482, so that's just over 20 an hour. Quandenser CPU usage sitting at ~1.5%. Links per pair have risen from ~100k per pair to ~650k, and I suspect will increase a little more, so imagining the last 73 might take another 8-12 hours.
If it gets through this by the end of the day I'll leave it running, otherwise I'll start with the new version.
Phew! Now writing feature groups. I'm expecting 5.5 million, I think (the number of placeholders. Best case scenario 10,000 per hour = 23 days. Eek!
RAM used is only 32 Gb now, but still 270 Gb paged.
If I cancel and rerun with the newest version, will it redo the 482 pair processing steps, or have these been written to disk?
Thanks!
Unfortunately it will have to redo the clustering of the 482 pairs, what is written to disk is just auxiliary information..
The number of feature groups will indeed be around 5.5 million, so if the speed of processing the feature groups does not increase we might have to come up with a different solution.
Hi Matthew,
Running with the new version, leaving all of the files from previous runs, has failed. I tried twice, each time with
...
Processing pair 480/482: 121->127 (5541168 links)
Processing pair 481/482: 127->143 (5538564 links)
Processing pair 482/482: 143->119 (5538964 links)
Exception caught: bad allocation
It was the same point both times. I was observing the machine the first time round and there appeared to be plenty or RAM and swap space free.
Is this because I need to remove some of the intermediate files from previous runs with the former versions?
Thanks
The intermediate files should not be a problem. Some things that may help me debug this:
Quandenser_output/tmp/featureToGroupMaps/
folder or is it empty?Quandenser_output/tmp/matchFeatures/features.<x>.dat
files?Quandenser_output/maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat
and if so, what size is it?Is getting to this point at least going faster than previously?
Yes, it is definitely going faster. It failed overnight, so less than 12 hours.
Quandenser_output/tmp/featureToGroupMaps/
is emptyQuandenser_output/tmp/matchFeatures/features.<x>.dat
is 62 Gb (848 Kb to 214 Mb each)Quandenser_output/maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat
The drive has 370 Gb free, if this helps.
Thanks again
Thanks for the info, I'll take a look at what could be going wrong.
Thanks
I'm still not quite sure why it ran out of memory, but I did some optimization of the memory allocation and added some logging messages: quandenser-v0-02-win64-msvc-x64.zip
If this does not work, I think I might be able to drastically reduce the memory footprint by not loading in the "placeholder features", but this will require some more significant restructuring of the code.
Thanks, rerun started immediately.
It's progressed well without error and is now writing feature groups. This step remains slow: 5.5 million to write, and it's progressing by 20,000 in 38 minutes, suggesting it will take a week to complete. As long as it works, I'll manage this!
OK, it finished this morning, so quicker than a week! I now have a 27 Gb feature group file, 625 Mb of MaRaCluster clusters and 8.5 Gb of consensus spectra to enjoy!
I received errors during conensus spectra generation:
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_0_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_1_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_2_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_3_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_4_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_5_tmpfile.ms2: Permission denied
Warning: Can't remove .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part47_6_tmpfile.ms2: Permission denied
I can see the files and delete them, so this isn't a problem. However, I note that the tmpfiles go to part 47, but consensus spectra were only made up to part 16. Is this right?
Merging spectra in bin 48/48
Processing consensus spectra.
Batch 1/7
Batch 2/7
Batch 3/7
Batch 4/7
Batch 5/7
Batch 6/7
Batch 7/7
Merging spectra
Warning: Can't remove...
Creating merged MSData file
Writing file .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part16.ms2
Finished writing file .\quandenser_output_only_samples_noOX1\consensus_spectra/MaRaCluster.consensus.part16.ms2
Finished merging clusters
Running Quandenser took: 332495 cpu seconds or 332491 seconds wall time
Thanks for your help getting this through. I think the analysis will keep me quite busy now!
Excellent! It is indeed expected that the tmpfiles go to higher part numbers, so nothing to worry about there.
Superb! Thanks
After several reruns due to running out of memory and disk space, processing 241 RAW files, quandenser has completed. However, I seem to be missing critical output.
maracluster_extra_features
directory contains only:I am missing the
clusters_p10.tsv
file, among others. There is also crucially noQuandenser.feature_groups.tsv
in the output root.I cannot see that any errors occurred during processing (see tail of log attached). Notably, all the dinosaur files, the first round of clustering and all but two percolator-dinosaur runs were already on disk.
Help gratefully appreciated.
log3.txt