spacetelescope / hstcal

Calibration for HST/WFC3, HST/ACS, and HST/STIS
BSD 3-Clause "New" or "Revised" License
10 stars 29 forks source link

Notes for speedup of cte code #20

Open sosey opened 7 years ago

sosey commented 7 years ago

DMS would like the code to run faster, but they don't want to use the parallel processing inside the code, rather they want to use the coarse grained parallel processing in HTCondor. So these are notes looking at speeding up the algorithm for a single processor.

For later reference, from the performance profiling I did a couple months ago:

calwf3_speed_shortlist

Also need to go back and do a few different types of runs to validate these numbers; the current eye-bleeding figure:

output

Making this an issue now, but we may expand to a project if ACS and WFC3 want to work on it together. The earliest any updates from this work would go into DMS is summer 2017.

pllim commented 7 years ago

c/c @jamienoss

jamienoss commented 7 years ago

@pllim thanks, was watching this already :) @sosey have you tried letting the optimizers do the grunt work? You could bump up the cc op up from -O2 to -O3. I would then look for the vectorization report and go through any loops that it was having difficulty in vectorizing. I am not sure, yet, how memory intensive the code is, or can be when multiple processes are being batched processed on the same node, but it might also be an idea to look into hardwiring the affinity and turning off the hyperthreading.

Has there been any thought for trying this on a gpu?

pllim commented 7 years ago

Re: GPU. Someone tried that many years ago but it was too hardware dependent to be distributed to general public (or at least that was my understanding of it). Maybe @stsci-hack remembers more details.

sosey commented 7 years ago

@jamienoss yes on the optimizations, have done all that, and other things as well. GPUs are NOT an option. Speeding the code up is secondary right now to getting it to work on ACS, I would not waste time investigating optimizations until you familiarize yourself with the ACS code tree and get the basic changes implemented, this may take more time than you realize. ill be away for 3 weeks, wait until I get back from vacation and we can discuss.

sosey commented 7 years ago

@jamienoss vectorizing this code is difficult because of lexical data dependence in the loops, we are also building on an older Gcc with openmp < 4 and no SIMD.

mdlpstsci commented 1 year ago

The ACS and WFC3 CTE code has diverged, and the much newer WFC3 has already achieved a faster execution time as implemented by Jay. Further improvement in execution time would be beneficial, but this is of lower priority than implementation of algorithms needed to improvement the science data.