kirxkirx / vast

Variability Search Toolkit (VaST)
http://scan.sai.msu.ru/vast/
GNU General Public License v3.0
13 stars 3 forks source link

Out of memory issue with sysrem2 #9

Closed mrosseel closed 4 years ago

mrosseel commented 4 years ago

util/sysrem2 is killed while processing 20k+ images. The server has 16Gb of memory.

Any tips on ideas how to improve this besides buying more RAM?

VAST_problems

mrosseel commented 4 years ago

I've seen the commits, thx for looking into this.

From our side: There was 16Gb of ram and 1Gb of swap, the swap has now been increased to 20Gb and vast is started. This will take about 5 days so we'll see :)

kirxkirx commented 4 years ago

Thanks for the report! The use of multithreding with OpenMP might exaggerate the effect of "insufficient" memory. The quick fix might be to manually limit the number of threads at runtime: OMP_NUM_THREADS=1 util/sysrem2 I know this is not a real solution because we do want multithreading when processing lots of data. The solution might be in modifying the algorithm, so it does not try to process all the measurements and or all the star at the same time. Maybe splitting the set of lightcurves in chunks of no more than a few thousand objects and running SysRem (and maybe the following variability search too?) on them idependently might lead to an acceptable result? Actually, this can be achieved by randomly splitting the lightcurve files in two or more directories, then loading and processing the data one directory at a time. (Again, a dirty workaround rather than a proper solution.)

I actually have a problem reproducing the out-of-memory situation for util/sysrem2 as I don't have a dataset at hand that is that big. If this is OK with you, can you please send me your lightcurves? Like save them using util/save.sh, tar+gzip the resulting directory and upload to http://scan.sai.msu.ru/upload/ or share with me via any other means?

kirxkirx commented 4 years ago

Sorry, this took me ages, but I believe the util/sysrem2 out-of-memory issue should be solved with the latest commit. I've revised the code memory usage and found large memory allocations that were used only to compute statistics (like median correction value) - not part of actual computations. Then, if the number of stars is larger than 2 * SYSREM_N_STARS_IN_PROCESSING_BLOCK, I made the code to split the input dataset in blocks of SYSREM_N_STARS_IN_PROCESSING_BLOCK lightcurves that are processed independently. This may take longer to compute (the number of iterations needed for the algorithm to converge seems to remain about the same as before and now one needs to repeat these iterations for each block), but avoids loading all the lightcurves into memory at the same time.

With the current setting of SYSREM_N_STARS_IN_PROCESSING_BLOCK 6000 in src/vast_limits.h the test dataset you sent me can be processed in less than 24 hours on my 8GB RAM laptop. (I left it overnight and at the time of writing it is still computing, but is more than half way done and definitely didn't crash). If I set SYSREM_N_STARS_IN_PROCESSING_BLOCK 1000 in a smaller test dataset I start to see slight differences in the results compared to everything-in-one-block computations, so probably it makes sense to keep SYSREM_N_STARS_IN_PROCESSING_BLOCK high.

kirxkirx commented 4 years ago

Yes, util/sysrem2 run on test data completed successfully.

mrosseel commented 4 years ago

sounds good, thx for the effort! We're still using the released version but will switch to master