Closed nichubbard closed 8 years ago
I added O2 to the Makefile which has changed the profile, removing the std::vector stuff. And I think making it a bit faster, however the top functions are as follows:
26.57 4.06 4.06 885721 0.00 0.02 main_event(EVENT_HEADER*, void*)
1 20.91 7.26 3.20 1771442 0.00 0.00 multiTDC::multiTDCSort(int, int*, float*)
2 11.58 9.03 1.77 885721 0.00 0.00 ZeroTTreeVariables()
3 6.22 9.98 0.95 885721 0.00 0.00 RawData::Init(RawData*)
main_event is probably due to the actual Tree filling, and so not too shocking. I am not sure why ZeroTTreeVariables is so slow. multiTDC sort still appears, then RawData init.
So I think looking into this multiTDCsort may be a good priority for later. profile.txt
There seems to be a memory leak somewhere, which is likely the reason it starts quick then grows painfully slow. No idea where it is yet
Update: We found a memory leak in adc.c, and it is fixed in 'master' now. This solves the memory leak problem, and the sort doesn't go quite as slow. It's still not super fast and it may be worth looking into other things soon.
For reference:
Without _RAWDATA
, _SILICONDATA
and _MMM
:
/experiment-data/PR254/run03040.mid.gz:284428 events, 18.81s
Using _SILICONDATA
and _MMM
preprocessor directive but sans _RAWDATA
, and using PR254 run 3040:
Time on arthur is /experiment-data/PR254/run03040.mid.gz:284428 events, 100.54s
With _RAWDATA
as well:
/experiment-data/PR254/run03040.mid.gz:284428 events, 148.57s
With _RAWDATA
and not using _SILICONDATA
and __MMM
:
/experiment-data/PR254/run03040.mid.gz:284428 events, 65.93s
So... the problem originates in the sorting of the silicon data. This (I assume MMM.c) should be the focus of any efforts on optimisation.
It may also be worth seeing what effect removing TestEvent()
from the sort code has, for example.
So I have been thinking "this sort is slow" for a while still, and decided to relook into it.
The profiles I get at York on npp3 indicate nearly all the time is spent calling deflate()
from libz.so, which is the system compression library. I don't think there's much that can be done, but I do find it bizarre that compression seems to be so slow.
In that case, we might want to pronounce death on this issue. At least for now. It might just have to be the case that people analysing data have to bite the bullet and know that it will take hours to sort all of the data. Or they could decompress the files beforehand but that'll take a lot of disk space (and not help anyone sorting directly from /experiment/data).
@nichubbard could you maybe send the profile output to me, if you don't mind ?
Attached, it's output from callgrind callgrind.out.27249.zip
@nichubbard Hi... sorry only replying now... been little busy... I took a look at your profile output, I will generate my own and compare, I checkout PR264 branch , and will run it on some data, if you could maybe give me some run numbers. It's been awhile since I had to produce a picture for a profile :), but attached is a graphic display image of the call graph of your profile output from callgrind. As you can see from the call graph, most of the time is spend in main_event in TTree Branch FIll. I will look at more detail soonest... but have to finish some stuff, is this also what you found? btw, TBranch Writebasket calls deflate. While TTree:Fill is called the exact number of times as function main_event, TBranch::Fill is called 100 times more per call . ROOT files are compressed files, hence all the calls to libz.
We've noticed sorting large runs on the PR264 branch can become very slow (100/s, slower than "live"). It was proposed that this may be because of unnecessary ADC loops in MMMSort. Profiling it however lays the blame on multiTDCSort, as well as std::vector operations but they should be optimised out (I dont know if this is a limitation of profiling, or an issue with the code, I shall look into it). Any changes or proposed techniques for speeding up code welcome.
profile.txt