Slow Sorting (PR264) - Githubissues

nichubbard commented 8 years ago

We've noticed sorting large runs on the PR264 branch can become very slow (100/s, slower than "live"). It was proposed that this may be because of unnecessary ADC loops in MMMSort. Profiling it however lays the blame on multiTDCSort, as well as std::vector operations but they should be optimised out (I dont know if this is a limitation of profiling, or an issue with the code, I shall look into it). Any changes or proposed techniques for speeding up code welcome.

profile.txt

nichubbard commented 8 years ago

I added O2 to the Makefile which has changed the profile, removing the std::vector stuff. And I think making it a bit faster, however the top functions are as follows:

 26.57      4.06     4.06   885721     0.00     0.02  main_event(EVENT_HEADER*, void*)                                                                                                                         
  1  20.91      7.26     3.20  1771442     0.00     0.00  multiTDC::multiTDCSort(int, int*, float*)                                                                                                                
  2  11.58      9.03     1.77   885721     0.00     0.00  ZeroTTreeVariables()                                                                                                                                     
  3   6.22      9.98     0.95   885721     0.00     0.00  RawData::Init(RawData*)

main_event is probably due to the actual Tree filling, and so not too shocking. I am not sure why ZeroTTreeVariables is so slow. multiTDC sort still appears, then RawData init.

So I think looking into this multiTDCsort may be a good priority for later. profile.txt

nichubbard commented 8 years ago

There seems to be a memory leak somewhere, which is likely the reason it starts quick then grows painfully slow. No idea where it is yet

nichubbard commented 8 years ago

Update: We found a memory leak in adc.c, and it is fixed in 'master' now. This solves the memory leak problem, and the sort doesn't go quite as slow. It's still not super fast and it may be worth looking into other things soon.

padsley commented 8 years ago

For reference:

Without _RAWDATA, _SILICONDATA and _MMM:

/experiment-data/PR254/run03040.mid.gz:284428 events, 18.81s

Using _SILICONDATA and _MMM preprocessor directive but sans _RAWDATA, and using PR254 run 3040:

Time on arthur is /experiment-data/PR254/run03040.mid.gz:284428 events, 100.54s

With _RAWDATA as well:

/experiment-data/PR254/run03040.mid.gz:284428 events, 148.57s

With _RAWDATA and not using _SILICONDATA and __MMM:

/experiment-data/PR254/run03040.mid.gz:284428 events, 65.93s

So... the problem originates in the sorting of the silicon data. This (I assume MMM.c) should be the focus of any efforts on optimisation.

It may also be worth seeing what effect removing TestEvent() from the sort code has, for example.

nichubbard commented 8 years ago

So I have been thinking "this sort is slow" for a while still, and decided to relook into it. The profiles I get at York on npp3 indicate nearly all the time is spent calling deflate() from libz.so, which is the system compression library. I don't think there's much that can be done, but I do find it bizarre that compression seems to be so slow.

padsley commented 8 years ago

In that case, we might want to pronounce death on this issue. At least for now. It might just have to be the case that people analysing data have to bite the bullet and know that it will take hours to sort all of the data. Or they could decompress the files beforehand but that'll take a lot of disk space (and not help anyone sorting directly from /experiment/data).

gauss345 commented 8 years ago

@nichubbard could you maybe send the profile output to me, if you don't mind ?

nichubbard commented 8 years ago

Attached, it's output from callgrind callgrind.out.27249.zip

gauss345 commented 8 years ago

tree_graph

@nichubbard Hi... sorry only replying now... been little busy... I took a look at your profile output, I will generate my own and compare, I checkout PR264 branch , and will run it on some data, if you could maybe give me some run numbers. It's been awhile since I had to produce a picture for a profile :), but attached is a graphic display image of the call graph of your profile output from callgrind. As you can see from the call graph, most of the time is spend in main_event in TTree Branch FIll. I will look at more detail soonest... but have to finish some stuff, is this also what you found? btw, TBranch Writebasket calls deflate. While TTree:Fill is called the exact number of times as function main_event, TBranch::Fill is called 100 times more per call . ROOT files are compressed files, hence all the calls to libz.

padsley / k600analyser

Slow Sorting (PR264) #137