star-bnl / star-sw

Core software for STAR experiment
26 stars 63 forks source link

SL23f crash in StFstRawHitMaker #631

Closed genevb closed 5 months ago

genevb commented 5 months ago

This doesn't crash for all DAQ files, but it does for some. Here is a crashing chain to try:

starver SL23f
root4star -b -q -l 'bfc.C(10,"DbV20231127 pp2022a StiCA fst ftt fstRawHit fstMuRawHit BEmcChkStat -hitfilt","/star/data03/daq/2021/352/22352002/st_fwd_22352002_raw_1500011.daq")'

I ran in the debugger and got this....

StFstRawHitMaker:WARN  - StFstRawHitMaker::Make() - No raw ADC dataset found from simu data! 
StFstRawHitMaker:WARN  - StFstRawHitMaker::Make() - No fstCollection found in simu dataset! 
 StFstRawHitMaker:INFO  -  Trying to read ALLdata
*** Error in `/afs/rhic.bnl.gov/star/packages/SL23f/.sl73_gcc485/bin/root4star': malloc(): memory corruption: 0x123aff00 ***

....followed by a long backtrace. I'm not sure why it mentions /tmp/smirnovd in here:

(gdb) where

0 0xf7fdb425 in __kernel_vsyscall ()

1 0xf4f1e1f7 in raise () from /lib/libc.so.6

2 0xf4f1fa33 in abort () from /lib/libc.so.6

3 0xf4f5d5e5 in __libc_message () from /lib/libc.so.6

4 0xf4f66a03 in _int_malloc () from /lib/libc.so.6

5 0xf4f6818a in malloc () from /lib/libc.so.6

6 0xf5113b27 in operator new(unsigned int) () from /lib/libstdc++.so.6

7 0xf7dd48dd in TStorage::ObjectAlloc (sz=52)

at /tmp/smirnovd/spack-stage/spack-stage-root-5.34.38-fta7antlmbz65avo4vw6tf7xsbtghfc4/spack-src/core/base/src/TStorage.cxx:325

8 0x0808df27 in TObject::operator new (sz=52)

at /cvmfs/star.sdcc.bnl.gov/star-spack/spack/opt/spack/linux-rhel7-x86/gcc-4.8.5/root-5.34.38-fta7antlmbz65avo4vw6tf7xsbtghfc4/include/TObject.h:156

9 0xeaca52af in StFstRawHitCollection::getRawHit (this=0xf050794, elecId=76) at .sl73_gcc485/obj/StRoot/StFstUtil/StFstRawHitCollection.cxx:120

10 0xe927ce6c in StFstRawHitMaker::FillRawHitCollectionFromAPVData (this=0xf050460, dataFlag=2 '\002', ntimebin=9, counterAdcPerRgroupPerEvent=0xfffca198,

sumAdcPerRgroupPerEvent=0xfffca1f8, apvElecId=0, signalUnCorrected=..., signalCorrected=..., seedFlag=..., idTruth=...)
at .sl73_gcc485/obj/StRoot/StFstRawHitMaker/StFstRawHitMaker.cxx:565

Curiously, gdb can't seem to find the source code unless it is present locally. Doing so, I find...

9 0xeaca52af in StFstRawHitCollection::getRawHit (this=0xf050534, elecId=75) at .sl73_gcc485/obj/StRoot/StFstUtil/StFstRawHitCollection.cxx:120

120 rawHitPtr = new StFstRawHit();

I'll try running in valgrind, but if someone else knows immediately what's wrong, please chime in.

-Gene

genevb commented 5 months ago

I should also note I'm running in 32-bit mode, and I get the crash in either optimized or not.

I've put a valgrind report here: ~genevb/public/ValgrindReport_StFstRawHitMaker_CrashSL23f.txt

There are some invalid reads just before the crash (search for FATAL in the above report), at StFstRawHitMaker.cxx:551,552,553, and 313.

genevb commented 5 months ago

Run numbers marked BAD for crashing and good for not crashing are located here: ~genevb/public/BADgoodRuns_StFstRawHitMaker_CrashSL23f.txt There is no overlap, and the clear distinction from looking at a bunch of these in the RunLog Browser is that they crash IF AND ONLY IF fst was IN the run.

jdbrice commented 5 months ago

Hi gene, thanks for this info. I will work on this and also get the FST grip on it. Btw we are working on QA.

genevb commented 5 months ago

I'm not sure this is any additional help, but seeing nothing else reported here, I re-ran valgrind with --leak-check=full to see if that shed any further light. I don't see any other notices in the additional output about StFstRawHitMaker (they're all about FstmGeom and FstmConfig in FSTMGEO). Anyway, here is that output: ~genevb/public/ValgrindReportFull_StFstRawHitMaker_CrashSL23f.txt

techuan-huang commented 5 months ago

Hi Gene and Daniel, thanks for finding this issue and the infos. I have made a pull request to fix this. It is due to the inconsistent number of time bins between data and the codes. Just change the corresponding constant number will fix this issue.

plexoos commented 5 months ago

resolved by #634