trackreco / mkFit

Vectorized, Parallelized Tracking
https://trackreco.github.io/
Apache License 2.0
17 stars 15 forks source link

Fixing up of SlurpIn #168

Open kmcdermo opened 5 years ago

kmcdermo commented 5 years ago

As reported in various threads over the last weeks, we have seen crashes within the simtrack validation during standard building. The crash would occur with many threads and MEIF, falling down at some "fixed" number of events.

The cause has been tracked down to an abuse of SlurpIn within the BackwardFit leading to undefined behavior. Namely, when computing offsets to read in the tracks to fit (which come from vector of vectors), there is the possibility the offsets become too large. Computing offsets from different allocations brings SlurpIn crashing down.

We have been burned by this before, so this is a call to fix this up for good. So we have a short term and long term plan.

Short term

Since we know this crash is triggered by large amounts of memory in play, we can simply avoid triggering the crash by limiting the number of events processed. Given we have a few open PRs (and more to come), the recommendation for now is to simply change the number of events to process in the validation from 500 to 100.

diff --git a/val_scripts/validation-cmssw-benchmarks.sh b/val_scripts/validation-cmssw-benchmarks.sh
index 595f388..42b2c97 100755
--- a/val_scripts/validation-cmssw-benchmarks.sh
+++ b/val_scripts/validation-cmssw-benchmarks.sh
@@ -17,7 +17,7 @@ source xeon_scripts/init-env.sh
 dir=/data2/slava77/samples/2017/pass-c93773a/initialStep
 subdir=PU70HS/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU
 file=memoryFile.fv3.clean.writeAll.CCC1620.recT.082418-25daeda.bin
-nevents=500
+nevents=100

 ## Common executable setup
 maxth=64

Long Term

@osschar has volunteered to give this a try by designing some interfaces for some helper packer for the Matriplexes to avoid total rewrites of I/O tracks/hits to matriplexes and related routines.

@srlantz and @dan131riley have also agreed to think on this.

srlantz commented 5 years ago

Here is at least an outline of the specific problem, as well as the proposed fix. The goal of SlurpIn is to vgather data from various tracks or hits into a Matriplex. The vgather intrinsic function takes a single base address and a vector of offsets to read (say) 16 floats from 16 different addresses. You can then advance the base pointer and vgather the next 16 floats, then the next 16, etc., storing them sequentially. The resulting data array will then be ordered correctly for a Matriplex. However, these direct vgathers only work for byte offsets <2GB. (In the current code, it's a problem even if intrinsics are not used, because the byte offsets are stored as a vector of ints.)

Proposed fix: rather than directly vgathering data from objects stored at arbitrary memory addresses, do this in two stages. First, copy strips of memory from the desired objects sequentially (such copy operations are vectorizable), and store them into a single flat array big enough to hold (say) 16 such strips. Second, do vgathers as before, but now from the flat array; this flat array will always be limited in size, so there is no risk of having excessively large offsets from the base address. This method was in fact implemented in mtorture a long time ago (in a more limited way), and the favorable performance results from it were presented at CHEP2015. The very simple CopyInContig/Plexify code still exists in the plexify branch of mtorture.

For mkFit, it is likely that some "PackerHelper" class needs to be developed to generalize the process of assembling the strips into the flat array, so that Matriplex itself does not need to worry about whether the strips of data come from Track objects or Hit objects or whatever. Some sort of Plexify method still exists in Matriplex(Sym).h, and this may be able to perform the final step of turning the flat array into a Matriplex (using intrinsics or not, depending on ifdef's).

osschar commented 5 years ago

I started a branch with this: https://github.com/osschar/mictest/tree/packer

And the only commit: https://github.com/osschar/mictest/commit/e4e7af3cdb5a6720ce7227f65415d3d5b31cda41

Please take a look, esp. the Cornellians ... then I have to redo this for Hits ... and Steve needs to provide code for the XyzzPackerPlexify versions.

I plan to template those to Matriplex::value_type ... and use that as the type of m_base and sizeof(T) as scale for vgather. But I need to change this in Matriplex, too, and have AVX2 changes pending there.

osschar commented 5 years ago

I have updated the packer branch with a more general implementation of the SlurpIn packer.

Note that I only use this in MkFinder::BkFitInputTracks() ... as a demo how it would look. If I start changing this everywhere I'll get conflicts when merging with the AVX2 branch.

The best way to view the whole thing is now: https://github.com/cerati/mictest/compare/devel...osschar:packer?expand=1

osschar commented 5 years ago

OK, new version, with changes to all usages of SlurpIn in MkFinder/Fitter. I had to move to a new branch packer-1 as I did rebase onto devel after AVX2 merge.

Compare with devel: https://github.com/cerati/mictest/compare/devel...osschar:packer-1?expand=1

The branch: https://github.com/osschar/mictest/tree/packer-1

dan131riley commented 5 years ago

Hi @osschar

This mostly looks pretty good, I think. Two specific comments and one more general one.

In MkFitter.cc line 209, mhp.Reset(); should be mhp.ResetBase(), since when hi in the outer loop changes, layerHits[hi][hidx] is getting from a different array.

AddInputAt() filling with zeroes means it will copy whatever is at the base address--is that the intended behavior?

On the general design, the one thing I don't really like is having AddInput implicitly set m_base if it is zero. I'd rather see the base address set in the constructor, and get rid of ResetBase(). Then the constructors would look like

   MatriplexPackerSlurpIn(const D* d) :
      m_base (d),
      m_pos  (0)
   {}

   MatriplexErrParPackerSlurpIn(const T& t) :
      MatriplexPackerSlurpIn<D>(t.errArray()),
      m_off_param(t.posArray() - this->m_base)
   {}

and usage (using the aforementioned line 209) would look like

  for (int hi = 0; hi < Nhits; ++hi)
  {
    MatriplexHitPacker mhp(layerHits[hi][0]);

    for (int i = beg; i < end; ++i)
    {
      const int   hidx = tracks[i].getHitIdx(hi);
      const Hit  &hit  = layerHits[hi][hidx];

      HoTArr[hi](i - beg, 0, 0) = tracks[i].getHitOnTrack(hi);
      mhp.AddInput(hit);
    }

    mhp.Pack(msErr[hi], msPar[hi]);
  }
osschar commented 5 years ago

OK, I pushed rebased branch to cerati/packer

Note that I also somewhat fixed the backward fit by checking the offset between track vectors and splicing them up. I say somewhat as I would expect the ptr-diff of 0x7fffffff or at least 0x1fffffff would work ... instead things start to work when diff is limited to 0x1fffff. Go figure. With this change I can now run, with root validation, over all 5000 event in the hih statistics sample:

./mkFit --cmssw-n2seeds --num-thr 64 --num-thr-ev 32 --input-file /data2/slava77/samples/2017/pass-c93773a/initialStep/PU70HS/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.CCC1620.recT.082418-25daeda.bin --sim-val --try-to-save-sim-info --backward-fit --backward-fit-pca --build-std
srlantz commented 5 years ago

@osschar I've been staring at your new classes in MatriplexPackers.h for quite a while and I think I now have a pretty clear idea of what I want to do in order to implement my alternate packing method. But I have a question about it as well... I'll start by sketching what you've done so far, and then tell you what I'm thinking of adding...

That's a long-winded intro, but anyway, here's my question for you. Where and how do I define the temp array? The problem is, its size will depend partly on the size of T, which can change. But I don't want to keep creating and destroying such an array every time I go to Pack a Matriplex of a given size. I think it would be better to put some kind of singleton in MatriplexPackers.h (I guess as part of my new base class template), making it large enough to hold the fArray member of any Matriplex I am going to Pack at any point in the code. Would this be a good approach? Or do you think it's too hacky?

osschar commented 5 years ago

Yes, you got it right :)

In constructor we pass in T& as the "base" argument. You can use this to extract the size of the target matriplexes (something like base.errMatrix().GetNElements()) ...that's for ErrPar. We can also add a second parameter for the "payload size" to the constructor that gets ignored in the SlurpIn implementation (as this information is then implicit in Matriplexes passed into Pack).

srlantz commented 5 years ago

OK, I'll go ahead and code it up and we should be able to converge from there :-)