Speed up the simulation

lucabaldini commented 11 months ago

At this point hxsim runs at something between 200 and 1000 events per seconds depending of the machine and on the particular setup. This implies roughly O(1 hour) for 1 M events, which is not outrageous, but still kind of annoying. It would be nice to speed things up by at least a factor of a few. A few lines of action might be:

check if we're doing something wrong with the generation of the noise on the matrix---it looks like adding the noise per se slows things down by a factor of ~2;
check if I/O is a bottleneck at all;
check what's the smallest sensible padding that we can use, in order to reduce the event size;
check if there's something that we can vectorize without complicating the logic flow too much.

I am sure there are others.

lucabaldini commented 11 months ago

I think I understand the issue with the noise---we are adding the noise in a idiotic way: we are generating a gaussian random variate for each pixel in the matrix, that is, 100 k random numbers for each event, while it would make much more sense to add the noise only within the ROI at the very end. (We need to think carefully about what we want to do for the noise at the trigger level, but that can be done in a later stage, as it' s much less critical for the modeling of the performance.)

lucabaldini commented 11 months ago

Tentative fix for the noise: https://github.com/lucabaldini/hexsample/commit/7488adf60277423bfde67db687101484c4e3e423

lucabaldini commented 11 months ago

Reference plots for a 10k event simulation with the main branch (0.3.1) with nominal settigs clusize_old pha_old

lucabaldini commented 11 months ago

And same thing after the change to the noise simulation clusize_new pha_new

lucabaldini commented 11 months ago

And the corresponding timing: old

10000it [00:23, 421.00it/s]

vs new

10000it [00:08, 1192.19it/s]

It looks like we gain about a factor of 3!

lucabaldini commented 11 months ago

I took a cursory look at the importance of the I/O, and if we disable the writing of the output file altogether the net improvement in the simulation speed is about 10%---so there's probably not all that much to be gained in there.

lucabaldini commented 11 months ago

And here is a slightly more detailed breakout of the simulation flow after the fix to the noise generation (this is just for the explicit event loop in Python after the photon list generation):

propagation of the charge to the readout: 60 us/event
digitization: 700 us/event
I/O to file: 60 us/event

We should definitely look into the digitization to see if there is room for improvement there.

lucabaldini commented 11 months ago

Breaking out the contributions in the digitization:

hexagonal sampling: 250 us/event
trigger: 400 us/event
calculation of the ROI: 50 us/event

(All numbers are +/- 10%, but you get an idea of the orders of magnitude.)

lucabaldini commented 11 months ago

Provisional conclusions are:

we definitely want to move the generation of the noise in the ROI, at the very end of the simulation process;
we want to look into HexagonalReadout.trigger() and HexagonalReadout.sample() methods (in this order) to see whether there are obvious ways to speed up things;
we leave everything else as it is, for the time being.

At this point, with the noise tweak, we are running at 1200 Hz on my laptop. Reaching 5 kHz would be awesome, but really anything in between is probably fine at this stage.

lucabaldini commented 11 months ago

And when we push the noise change into production we want to open an issue to remind ourselves that in the future we might want to do something with the noise at the trigger level.

lucabaldini commented 11 months ago

Here are some more precise tests of the four basic steps going into the sampling and triggering:

>>> world_to_pixel(): 0.425 s / 10000 = 42.5 us per event
>>> create_hist(): 1.374 s / 10000 = 137.4 us per event
>>> Signal shape: (352, 304)
>>> sum_miniclusters(): 4.205 s / 10000 = 420.5 us per event
>>> Trigger shape: (176, 152)
>>> zero_suppress(): 0.190 s / 10000 = 19.0 us per event

I think it's pretty obvious that the bottleneck is dealing with relatively large sparse arrays (i.e., whose vast majority is made of zeros.)

lucabaldini commented 11 months ago

And I now have a full implementation where I carefully avoid large sparse arrays and is significantly faster than the original: https://github.com/lucabaldini/hexsample/pull/22/commits/9048d1ce45d6f66ea92ab8635a9808fbe9b91a15

(The new implementation is significantly less clear from the coding standpoint but the speedup is probably worth the effort.)

lucabaldini commented 11 months ago

I have refactored the creation of the initial 2-dimensional histogram https://github.com/lucabaldini/hexsample/commit/76eb313af660d98a8d82335e65614bff862824a8 which gives another significant push in speed.

lucabaldini commented 11 months ago

At this point we are running at around 4.5 kHz on my laptop, that is, a factor of ~10 faster than at the beginning.

A deterministic profiling yields

         14481935 function calls (14059257 primitive calls) in 14.617 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1504/1    0.022    0.000   14.666   14.666 {built-in method builtins.exec}
        1    0.000    0.000   14.666   14.666 hxsim.py:1(<module>)
        1    0.115    0.115   13.607   13.607 hxsim.py:53(hxsim)
    50000    0.085    0.000    5.958    0.000 digi.py:365(read)
    50000    0.430    0.000    4.151    0.000 digi.py:232(sample)
    50000    0.099    0.000    3.646    0.000 fileio.py:264(add_row)
    50000    0.131    0.000    3.349    0.000 vlarray.py:493(append)
    50000    0.113    0.000    3.148    0.000 mc.py:65(propagate)
   100003    3.038    0.000    3.038    0.000 {method 'normal' of 'numpy.random.mtrand.RandomState' objects}
    50000    0.263    0.000    2.913    0.000 hexagon.py:299(world_to_pixel)
    50000    2.504    0.000    2.504    0.000 {method '_append' of 'tables.hdf5extension.VLArray' objects}
    50000    0.868    0.000    1.724    0.000 hexagon.py:252(_axial_round)
    50000    0.375    0.000    1.560    0.000 digi.py:288(trigger)
950648/550389    0.462    0.000    1.549    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       95    0.003    0.000    1.392    0.015 __init__.py:1(<module>)
   1099/9    0.003    0.000    1.059    0.118 <frozen importlib._bootstrap>:1022(_find_and_load)
   1095/9    0.002    0.000    1.059    0.118 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
   1044/9    0.002    0.000    1.058    0.118 <frozen importlib._bootstrap>:664(_load_unlocked)
    895/8    0.001    0.000    1.058    0.132 <frozen importlib._bootstrap_external>:877(exec_module)
  1563/10    0.001    0.000    1.057    0.106 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.970    0.970 fileio.py:1(<module>)
        1    0.000    0.000    0.947    0.947 mc.py:1(<module>)
  732/100    0.001    0.000    0.908    0.009 {built-in method builtins.__import__}
 1763/440    0.002    0.000    0.886    0.002 <frozen importlib._bootstrap>:1053(_handle_fromlist)
   750161    0.875    0.000    0.875    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   200001    0.076    0.000    0.746    0.000 <__array_function__ internals>:177(round_)
        1    0.000    0.000    0.644    0.644 source.py:1(<module>)
        1    0.000    0.000    0.642    0.642 plot.py:1(<module>)
   200001    0.065    0.000    0.603    0.000 fromnumeric.py:3764(round_)
        1    0.000    0.000    0.579    0.579 pyplot.py:1(<module>)
2939/2766    0.027    0.000    0.579    0.000 {built-in method builtins.__build_class__}
    50000    0.531    0.000    0.560    0.000 hexagon.py:228(_float_axial)
   200001    0.081    0.000    0.538    0.000 <__array_function__ internals>:177(around)
        1    0.000    0.000    0.483    0.483 mc.py:92(__init__)
        1    0.000    0.000    0.477    0.477 sensor.py:250(rvs_absz)
        1    0.000    0.000    0.477    0.477 sensor.py:240(rvs_absorption_depth)
        1    0.000    0.000    0.473    0.473 sensor.py:92(photoelectric_attenuation_length)
        1    0.000    0.000    0.473    0.473 sensor.py:59(_attenuation_length)
        1    0.000    0.000    0.473    0.473 materials.py:65(material_mu)
        1    0.000    0.000    0.472    0.472 xray.py:290(mu_elam)
        1    0.000    0.000    0.472    0.472 xraydb.py:665(mu_elam)
        1    0.001    0.001    0.472    0.472 xraydb.py:616(cross_section_elam)
      101    0.000    0.000    0.472    0.005 artist.py:129(_update_set_signature_and_docstring)
        1    0.002    0.002    0.471    0.471 utils.py:110(elam_spline)
      100    0.000    0.000    0.469    0.005 artist.py:105(__init_subclass__)
        1    0.102    0.102    0.461    0.461 utils.py:130(<listcomp>)
   300000    0.060    0.000    0.458    0.000 {method 'min' of 'numpy.ndarray' objects}
   400005    0.136    0.000    0.438    0.000 fromnumeric.py:51(_wrapfunc)
   300000    0.052    0.000    0.397    0.000 _methods.py:42(_amin)
    50000    0.134    0.000    0.388    0.000 vlarray.py:441(_getnobjects)
   200001    0.071    0.000    0.386    0.000 fromnumeric.py:3257(around)
    50000    0.240    0.000    0.367    0.000 hexagon.py:276(_axial_to_offset)
   100000    0.028    0.000    0.359    0.000 <__array_function__ internals>:177(flatnonzero)
   200000    0.042    0.000    0.325    0.000 {method 'max' of 'numpy.ndarray' objects}
   200001    0.053    0.000    0.318    0.000 {method 'sum' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.316    0.316 figure.py:1(<module>)
      114    0.001    0.000    0.308    0.003 artist.py:1743(kwdoc)
   100000    0.043    0.000    0.305    0.000 numeric.py:625(flatnonzero)
        1    0.000    0.000    0.303    0.303 sensor.py:1(<module>)
   200002    0.064    0.000    0.291    0.000 <__array_function__ internals>:177(nonzero)
   200000    0.032    0.000    0.282    0.000 _methods.py:38(_amax)
   200001    0.036    0.000    0.265    0.000 _methods.py:46(_sum)
    50000    0.025    0.000    0.262    0.000 <__array_function__ internals>:177(bincount)
    33103    0.015    0.000    0.258    0.000 inspect.py:725(getdoc)
    50000    0.037    0.000    0.252    0.000 utils.py:107(convert_to_np_atom2)
      216    0.016    0.000    0.242    0.001 artist.py:1435(get_setters)
    32048    0.148    0.000    0.240    0.000 inspect.py:744(cleandoc)
    50000    0.044    0.000    0.239    0.000 digi.py:190(sum_miniclusters)
        1    0.000    0.000    0.230    0.230 colorbar.py:1(<module>)
    27386    0.010    0.000    0.225    0.000 artist.py:1454(is_alias)
    50001    0.023    0.000    0.223    0.000 <__array_function__ internals>:177(sum)
      114    0.003    0.000    0.218    0.002 artist.py:1504(pprint_setters)
    50000    0.029    0.000    0.215    0.000 utils.py:78(convert_to_np_atom)
   200001    0.209    0.000    0.209    0.000 {method 'round' of 'numpy.ndarray' objects}
    50001    0.042    0.000    0.196    0.000 std.py:1161(__iter__)
    50001    0.042    0.000    0.176    0.000 fromnumeric.py:2160(sum)
      216    0.000    0.000    0.170    0.001 artist.py:1340(__init__)
      216    0.008    0.000    0.170    0.001 artist.py:1360(get_aliases)
   200002    0.043    0.000    0.167    0.000 fromnumeric.py:1866(nonzero)
    50000    0.022    0.000    0.158    0.000 flavor.py:183(array_of_flavor)
        1    0.000    0.000    0.158    0.158 _stats_py.py:1(<module>)
        1    0.000    0.000    0.156    0.156 _subplots.py:1(<module>)
        1    0.000    0.000    0.146    0.146 _axes.py:1(<module>)
9800/9258    0.023    0.000    0.143    0.000 inspect.py:2375(_signature_from_callable)
    50001    0.066    0.000    0.141    0.000 mc.py:106(__next__)
     9106    0.003    0.000    0.141    0.000 inspect.py:3252(signature)
   100004    0.027    0.000    0.138    0.000 <__array_function__ internals>:177(ravel)
     9106    0.003    0.000    0.138    0.000 inspect.py:2998(from_callable)
    50001    0.052    0.000    0.132    0.000 numeric.py:289(full)
    50160    0.055    0.000    0.129    0.000 fromnumeric.py:69(_wrapreduction)
        1    0.000    0.000    0.120    0.120 contour.py:1(<module>)
    50000    0.116    0.000    0.116    0.000 digi.py:200(zero_suppress)
        1    0.000    0.000    0.114    0.114 distributions.py:1(<module>)
    50001    0.094    0.000    0.110    0.000 hexagon.py:193(_parity_offset)
   100032    0.110    0.000    0.110    0.000 {method 'astype' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.100    0.050 collections.py:1(<module>

lucabaldini commented 11 months ago

The table is not trivial to parse, as nested functions are included in top-level ones, but at a first glance I would say that the cross a 10% threshold on the total wall time are

50000    0.430    0.000    4.151    0.000 digi.py:232(sample)
50000    0.099    0.000    3.646    0.000 fileio.py:264(add_row)
50000    0.113    0.000    3.148    0.000 mc.py:65(propagate)
50000    0.263    0.000    2.913    0.000 hexagon.py:299(world_to_pixel) -> called from within digi.py (sample)
50000    0.375    0.000    1.560    0.000 digi.py:288(trigger)

The mc call is fully vectorized and I don't think there's anything we can really do there. It looks like the last two things we might want to look into are the I/O and the world_to_pixel() conversion.

lucabaldini commented 11 months ago

At a first glance it is not obvious how to squeeze anything out of the world_to_pixel() function.

As expected, all of the I/O time goes into the variable-length side of things. Commenting the single line

self.pha_array.append(digi_event.pha.flatten())

at this point would give us roughly a 50% increase. Looking into the pytables docs...

lucabaldini commented 11 months ago

Nothing obvious from the docs. At this point I would say that for this first optimization we stop here and:

cleanup the code;
update the docs;
fix the unit tests;
verify that we get the same answer (event-wise) as before.

lucabaldini commented 11 months ago

I believe we are ready to merge this one, pending updating the release notes.

lucabaldini commented 11 months ago

Shipped in 0.4.0

lucabaldini / hexsample

Speed up the simulation #12