trackreco / mkFit

Vectorized, Parallelized Tracking
https://trackreco.github.io/
Apache License 2.0
17 stars 15 forks source link

segfaults and slowness for some build options #346

Open srlantz opened 3 years ago

srlantz commented 3 years ago

In V3.1.1.+pr344, for num-thr>16, the following methods generally segfault after a non-deterministic number of events: CE, STD, BH. In other words, basically all of the earlier, non-MIMI build methods are not functioning correctly at present.

Also, if num-thr<=16, even though segfaults are mostly (not always!) avoided, the original build methods CE, STD, BH are slower by 1-2 orders of magnitude. Given --num-thr 16, for example, --build-mimi --num-iters-cmssw 1 has an event loop time of 4.2 sec (or 3.2 sec. with icc AVX-512), but --build-ce runs in 147 sec. (or 342 sec. with icc AVX-512), and --build-bh runs in 107 sec.

Finally, --build-mimi --num-iters-cmssw 1 (just initialStep), which is the closest currently working equivalent of --build-ce from (e.g.) stable-2020 does not segfault and runs at a speed more consistent with past code versions, but it takes about 70% longer than previously. This is not really making an apples-to-apples comparison because the input files include more fields and are much bigger (among other factors).

What I did to build, after setting up the shell with source xeon_scripts/init-env.sh: make -j 32 AVX2:=1 CXX:=g++

What I did to start a typical test run (always omitting --remove-dup): ./mkFit/mkFit --input-file /data2/slava77/samples/2021/11834.0_TTbar_14TeV+2021/AVE_50_BX01_25ns/memoryFile.fv5.default.210729-d1dc487.bin --cmssw-n2seeds --num-events 640 --num-thr 16 --num-thr-ev 16 --build-XXX

Tests were mostly conducted on lnx7188, but similar behavior was observed on phi3.

Addendum 8/30/21: when compiling with icc 2021.3 and AVX2, the V3.1.1+pr344 version with --build-mimi --num-iters-cmssw 1 is similarly 90% slower than the stable-2020 version with --build-ce. Times on lnx7188 for a PU50 test with 6400 events, using the input file appropriate to each version, were 13.57 sec. vs. 7.13 sec. For comparison, the times for the same two mkFit versions when compiled with gcc 8.3.1 were 18.65 vs. 10.74 sec. (All tests were run twice in succession to ensure the input file was cached in RAM.)

mmasciov commented 3 years ago

Thanks, @srlantz!

Few comments (from MIC meeting, August 27):

mmasciov commented 3 years ago

As for the difference in performance between icc and gcc, it was noticed that icc and gcc performance diverged, most likely due to vectorization: http://uaf-10.t2.ucsd.edu/~mmasciov/MkFit_devs/MTV-PR338_ttbarPU50_gcc-vs-AVX2-vs-AVX2NoVec/plots_building_initialStep/hitsLayers.pdf This was the case as for PR #340.

At the time of https://github.com/trackreco/mkFit/releases/tag/V3.0.0-0%2Bpr315, difference looked larger: http://uaf-10.t2.ucsd.edu/~mmasciov/MkFit_multiIteration_forPR/plots_500evts_lessIters_all/plots_building_initialStep/hitsLayers.pdf

As the difference after PR #340 is concentrated in transition regions, suspicion is that this is related to hit selection windows.

From PR #344, a floor is used also in transition regions. I have not tested the difference after such PR within CMSSW.

From standalone validation in devel (PR #345), no difference is observed when comparing hit multiplicities across different compilers (reference [black] = make -j 32 AVX2:=1; other [red] = "CXX=g++ VEC_GCC='-msse3'"): http://uaf-10.t2.ucsd.edu/~mmasciov/MkFit_devs/SVA_devel_iccAVX2-vs-CXXgpp_VEC_GCC-msse3/quality/?match=nHits*fit*

dan131riley commented 3 years ago

I started looking at segfaults, and the first one I stumbled on is at https://github.com/trackreco/mkFit/blob/cb7089b8565f021561ef4e918139d5f1ff57b971/Geoms/CMS-2017.cc#L263-L266 where

(gdb) p hot
$1 = {
  index = -9,
  layer = 4
}

which according to our docs means

Idx == -9 : Reference track contained a hit that has since been filtered out via CCC before written to memory file. -9 hits are NOT stored in layerHits_.

I'm assuming this only affects the standalone mkFit reading memory files, so maybe it isn't a priority, but suggestions for an appropriate fix? Should we trim off -9 hits at the end of the track?

srlantz commented 3 years ago

It seems the real problem here is that we now have two types of input files: those containing various types of seeds, suitable for multiple iterations, and those containing seeds suitable just for initialStep. The older build methods, when given a file of the new, expanded type, end up doing pointless calculations on seeds that only make sense for iterations beyond initialStep. The result is "garbage in, garbage out".

That being the case, perhaps there should be a quick check of the input file prior to building, to ensure that the contents of the input file are compatible with the chosen build option. I don't think the solution is to always assume that the input is of the multiple-iteration type, because (at least for CE) this case is already covered by --build-mimi --cmssw-num-iters 1.

The fact that there are two types of files is revealed by the fact that in buildtestMPlex.cc, runBuildingTestPlexCloneEngine (as well as BestHit and Standard) has the line builder.PrepareSeeds();; whereas runBtpCe_MultiIter lacks such a line, and instead has a loop that checks the algo of each seed before pushing it onto the list of seeds for the current iteration. As it stands, this entails running through the entire list of seeds in ev.seedTracks_ on every iteration, which doesn't seem terribly efficient, especially considering that even just one trip through the loop doesn't scale when using multiple threads per event. It seems like a better approach would be (as mentioned in code comments) to partition the seeds and store the starting and ending indices of each iteration's sublist.

srlantz commented 3 years ago

I used gdb to determine that gcc's -ffast-math causes mkFit to crash in propagateHelixToZMPlex, specifically in the (inlined) call to applyMaterialEffects, where the optimizer apparently generates an out-of bounds reference when accessing hitsXi.ConstAt(n,0,0) (also inlined, from Matriplex.h.)

The crash can be sidestepped by preceding propagateHelixToZMPlex with #pragma GCC optimize ("no-fast-math"), or by giving the function one of these attributes:

void __attribute__ ((optimize("no-fast-math"))) propagateHelixToZMPlex(const MPlex LS &inErr,…
void __attribute__ ((optimize("no-inline"))) propagateHelixToZMPlex(const MPlex LS &inErr,…

Leaving aside the question of why the optimizer screws up the inlining, I ran performance tests in which all the rest of the code was compiled with -ffast-math to see if that option, with and without -funroll-all-loops, would run any faster and possibly use vectorized trig functions from libmvec. Unfortunately, the observed speedup was only a few percent. Furthermore, searching for the symbol names beginning with ZGV (true of all vectorized trig functions in libmvec) turned up nothing.

By contrast, the same types of searches on icc-compiled code--again applying either objdump or strings to the object files--turned up plenty of references to svml, i.e., the vectorized trig functions from Intel SVML. Moreover, the icc-compiled code ran about 30% faster on my quick tests with 6400 events (TTbar PU50). The icc version was 2021.3, the latest on lnx7188.

I did the above performance tests on lnx7188 with gcc 8, 9, and 10 in both CentOS 7 and 8 (the latter in a Singularity container). The best marginal performance gain came with gcc 10 in a CentOS 8 container obtained from cvmfs.