momentoscope / hextof-processor

Code for preprocessing data from the HEXTOF instrument at FLASH, DESY in Hamburg (DE)
https://hextof-processor.readthedocs.io/en/latest/
GNU General Public License v3.0
7 stars 4 forks source link

Boost Histogram #73

Closed zain-sohail closed 3 years ago

zain-sohail commented 3 years ago

I've pushed a new branch which is implementing this rather slow method to give weighted axis rather than the mean of the bin edges. I also change the numpy.hist method to boost_histogram's. Overall, it's slower because of the finer binning necessary for weighting. However, it would be helpful if I can get suggestions on how to improve this implementation.

PS: I also pushed the updated setup.py etc, which were not merged with master before.

steinnymir commented 3 years ago

Allthough the idea is nice, this implementation is probably going to crash memory, since multiplying by 100 the size of each bin will result in enormous arrays. We are already often reaching memory errors when binning in 4D. I wouldn't make this, as the advantages are not so large, while computation is barely feasable. Unless we find a histogram method which returns the bin position directly (or write our own! 😮 )

Boos Histograms sound nice, but I would not break away from the current implementation, but rather leave an option to choose which method to use. for example:

def computeBinnedData(... , method='numpy'):
    ...
    if method == 'numpy':
        histogram = np.histogramdd
    elif method == 'boost':
        histogram = bt.numpy.histogram
    res, edges = histogram(vals[:, colsToBin], bins=self.binRangeList)

then we can also compare easily the performance of one versus the other!

zain-sohail commented 3 years ago

Allthough the idea is nice, this implementation is probably going to crash memory, since multiplying by 100 the size of each bin will result in enormous arrays. We are already often reaching memory errors when binning in 4D. I wouldn't make this, as the advantages are not so large, while computation is barely feasable. Unless we find a histogram method which returns the bin position directly (or write our own! 😮 )

Indeed. This is how I've done the axis weighting before but not with 4D. I couldn't find or think of another way to weight the axis on binned data, and I don't think any library implements that either. A different method is generally compute the optimal bin width using Freedman Diaconis Estimator or sturges etc, or to use variable bin width and use that to compute the histogram.

Boos Histograms sound nice, but I would not break away from the current implementation, but rather leave an option to choose which method to use. for example:

def computeBinnedData(... , method='numpy'):
    ...
    if method == 'numpy':
        histogram = np.histogramdd
    elif method == 'boost':
        histogram = bt.numpy.histogram
    res, edges = histogram(vals[:, colsToBin], bins=self.binRangeList)

then we can also compare easily the performance of one versus the other!

That makes sense. I'll implement it that way, then and also try to test the performance.

zain-sohail commented 3 years ago

I've pushed my implementation. The boost histogram dd method is significanty faster in a test data case. image

However, it doesn't have any significant difference when I use computeBinnedData with or without boost. I am not sure but perhaps it is due to data partitioning we do in the method. Boost has a method for threading the data by itself. Perhaps that's worth looking into?

steinnymir commented 3 years ago

Yes, I am afraid the manual parallelization we force in computeBinnedData is reserving all cores, and doesn't leave space for more parallelization. Basically we iterate over the partitions and give N at a time to dask to compute, with N defined in processor.N_CORES.

You can try see if this changes by settings N_CORES to something smaller, and see how the times change. Atually a systematic evaluation of n_cores used vs performance, with the two binning methods would be amazing! It would be interesting to see how this changes also when binning 1D arrays vs 3-4D arrays, since this might be significantly different.

zain-sohail commented 3 years ago

Yes, I am afraid the manual parallelization we force in computeBinnedData is reserving all cores, and doesn't leave space for more parallelization. Basically we iterate over the partitions and give N at a time to dask to compute, with N defined in processor.N_CORES.

You can try see if this changes by settings N_CORES to something smaller, and see how the times change. Atually a systematic evaluation of n_cores used vs performance, with the two binning methods would be amazing! It would be interesting to see how this changes also when binning 1D arrays vs 3-4D arrays, since this might be significantly different.

I tried this for different dimensions with different N_CORES definition. However, no performance gain is found with this implementation. I found this on boost.org

Filling a histogram with growing axes in a multi-threaded environment is safe, but has poor performance since the histogram must be locked on each fill. The locks are required because an axis could grow each time, which changes the number of cells and cell addressing for all other threads. Even without growing axes, there is only a performance gain of filling a thread-safe histogram in parallel if the histogram is either very large or when significant time is spend in preparing the value to fill. For small histograms, threads frequently access the same cell, whose state has to be synchronized between the threads. This is slow even with atomic counters, since different threads are usually executed on different cores and the synchronization causes cache misses that eat up the performance gained by doing some calculations in parallel.

For parallization, they suggest to use a method where each thread can have it's own copy of the histogram and each copy is independently filled, after which it is added to the main thread.

I will try to implement it in a way boost suggests and then compare the numpy method against theirs.

steinnymir commented 3 years ago

That is how our method works. Here we split the dask.partitions between the different cores and have them create their own copy of the tartget histogram. This is then later summed up on a single core here

zain-sohail commented 3 years ago

That is how our method works. Here we split the dask.partitions between the different cores and have them create their own copy of the tartget histogram. This is then later summed up on a single core here

I've reimplemented the boost method using our framework. From my tests, the boost and numpy don't seem to have a significant difference in timing. This could be due to some implementation problem, because boost has performance much faster than numpy (order of magnitude) on their website. There is also a small difference in resulting histograms (See timing_tests notebook). There's also a python notebook included, for debugging purposes.

Boost has different axis methods. Regular one takes number of steps and min, max. Whereas, the variable method takes the bins. I thought it's better to implement Variable method since it can make use of our preexisting methods.

An interesting thing I noted (for the example data), increasing to multicore setup does speed up. However, from 5-15 cores, there seems to be barely a difference in timing. Perhaps this should be tested on a bigger dataset. Click the image to see the title. X axis is N_Cores, while Y axis is times. 1D_lib_comparison

steinnymir commented 3 years ago

Interesting result! it seems indeed like the biggest change is using any parallelization (as expected..) but that increasing the number of cores does not help, seems to get even slower!

I tried to test this myself but I did not get the boost binning running. Somehow checking out your branch was giving me a strange broken version.... could you show me an example of how you perform binning with your implementation of boost? My idea was to test the performance for 1,2,3 and 4D binnings of 100pts per axis, for 1,2,4,8,...64,96 cores, as maxwell gives us 96 for one jupyter node. In code:

dldTime = ('dldTime',690,710,0.24)
delayStage = ('dldMicrobunchId',0,100,1)
dldPosX = ('dldPosX',450, 950,5)
dldPosY = ('dldPosY',450, 950,5)
bin_combinations = [[dldTime],
                    [dldTime,delayStage],
                    [dldTime,delayStage,dldPosX],
                    [dldTime,delayStage,dldPosX,dldPosY]]

def bin(prc,bins,method):
    """this function should be corrected to work with the numpy and boost methods"""
    prc.resetBins()
    for axis in bins:
        prc.addBinning(*axis)
    return prc.computeBinnedData(method=method)

n_cpus = [1,2,4,8,16,32,64,96]
results = {'numpy':{'1D':[],'2D':[],'3D':[],'4D':[]},'boost':{'1D':[],'2D':[],'3D':[],'4D':[]}}
for n in n_cpus:
    for method in 'numpy','boost':
        for i,bins in enumerate(bin_combinations):
            t0 = time.time()
            try:
                _ = bin(prc,bins,method)
                dt = time.time()-t0
            except:
                dt= np.nan
            results[method][f'{i+1}D'].append(dt)

This should work with almost any run from any beamtime, and one could also check the difference between long and short counts (number of partitions..). Also, testing this with data in memory (prc.readData()) or with parquet files (prc.readDataframes()).

Anyway, if you just get me a working function to stick in there (the bin function in the code above), I can also do this...

zain-sohail commented 3 years ago

Interesting result! it seems indeed like the biggest change is using any parallelization (as expected..) but that increasing the number of cores does not help, seems to get even slower!

Yes indeed that is odd. But I tested with 1, 5, 10 and 15 cores. Perhaps it needs to be 2^n.

I tried to test this myself but I did not get the boost binning running. Somehow checking out your branch was giving me a strange broken version.... could you show me an example of how you perform binning with your implementation of boost? My idea was to test the performance for 1,2,3 and 4D binnings of 100pts per axis, for 1,2,4,8,...64,96 cores, as maxwell gives us 96 for one jupyter node. In code:

There is a notebook called timing_tests. It’s a rather rudimentary way of checking, built upon the example 1 notebook. I think the breaking is occurring because I put the updated setup.cfg, setup.py etc. It works for me on maxwell but not locally, so perhaps there is a problem. But if you just use the new DldProcessor file with your current setup, it should work.

dldTime = ('dldTime',690,710,0.24)
delayStage = ('dldMicrobunchId',0,100,1)
dldPosX = ('dldPosX',450, 950,5)
dldPosY = ('dldPosY',450, 950,5)
bin_combinations = [[dldTime],
                    [dldTime,delayStage],
                    [dldTime,delayStage,dldPosX],
                    [dldTime,delayStage,dldPosX,dldPosY]]

def bin(prc,bins,method):
    """this function should be corrected to work with the numpy and boost methods"""
    prc.resetBins()
    for axis in bins:
        prc.addBinning(*axis)
    return prc.computeBinnedData(method=method)

n_cpus = [1,2,4,8,16,32,64,96]
results = {'numpy':{'1D':[],'2D':[],'3D':[],'4D':[]},'boost':{'1D':[],'2D':[],'3D':[],'4D':[]}}
for n in n_cpus:
    for method in 'numpy','boost':
        for i,bins in enumerate(bin_combinations):
            t0 = time.time()
            try:
                _ = bin(prc,bins,method)
                dt = time.time()-t0
            except:
                dt= np.nan
            results[method][f'{i+1}D'].append(dt)

This should work with almost any run from any beamtime, and one could also check the difference between long and short counts (number of partitions..). Also, testing this with data in memory (prc.readData()) or with parquet files (prc.readDataframes()).

I will then try to use this for testing method

Anyway, if you just get me a working function to stick in there (the bin function in the code above), I can also do this...

Please use the current implementation of DldProcessor.py. It has all the methods. One has to use method = ‘boost’ in both addBinning and computeBinnedData, as boost uses a different binning approach (as mentioned in previous comment).

steinnymir commented 3 years ago

A quick look at the performance at increasing number of cores shows a very disappointing behaviour. image Clearly there is no gain in using more than 16 cores in the 1 2 and 3D binnings. The 4D probably has a very different behaviour because of the large ammount of memory it requires. In any case, this shows us we need some more improvement on the binning side! 😉 Even boost implemented this way doesn't seem to make it better...

oh.. bad plots with no axis: x is number of cores, and y is binning time in seconds.

zain-sohail commented 3 years ago

Do you think with a bigger dataset, the results would be in favor of higher core count?

I am not sure where the boost implementation is having timing problems, so I will look more into that and try to fix that issue.

steinnymir commented 3 years ago

I am not sure what to expect, but I think measuring the time it takes to make prc.computeBinnedData might be misleading. We should analyse the single pieces in there and see where the slow parts are and where there is room for improvement. It might also be we are having too much dask overheads limiting the performance of boost...

zain-sohail commented 3 years ago

Boost did not help with our use case but perhaps in MPES (without DASK overhead), it offers a greater improvement.