kirxkirx / vast

Variability Search Toolkit (VaST)
http://scan.sai.msu.ru/vast/
GNU General Public License v3.0
13 stars 3 forks source link

Make cleanup stages of processing multi threaded #8

Open mrosseel opened 4 years ago

mrosseel commented 4 years ago

removing images and everything that comes after that is not multi threaded and takes a very long time for 35k images.

kirxkirx commented 4 years ago

Yes, processing speed for sets of >10k images can and should be improved. Not writing multiple log files associated with each image is one way to save a lot of time on input/output - the working directory will contain "a_few times more_than_10k" files less. (These image logs are mostly useful at the debugging stage rather than for mass processing.)

Can you please specify what exactly do you mean by "removing images" stage? What VaST is writing to the terminal at this time?

mrosseel commented 4 years ago

Less files would certainly help!

Also make multithread what is now single thread such as:

kirxkirx commented 4 years ago

I'm sorry, so far I was unable to get a speed-up from parallelizing these steps. Basically, all of them are limited by the need to read every lightcurve (out*.dat) file and, depending on its content, either delete the file or replace it with its modified version. The procedure seems to be limited by the disk I/O speed rather than by CPU usage. If I try to read files in parallel, I get about the same execution time, but use all processor cores instead of just one - there is no speed up. Maybe the result would be different on a system with a very fast disk I/O, but I need to see it before introducing changes to the code.

In the mean time, I've made some minor improvements in the lightcurve reading routine (an option not to parse the whole VaST-format lightcurve string if we are interested only in the first three columns "JD mag err"), that provides a small, but measurable speed-up on large sets of lightcurves.

mrosseel commented 4 years ago

Thx for looking into this! Am I correct in assuming that these files are only read once for each filter ? (nr of points, outliers, less than 2 points)

As discussed above, did you also look at limiting log files? Will help with speed but also dirs get very slow with that many files :). I will leave this open in case you want to add something but feel free to close if you feel this issue is done.