star-bnl / star-sw

Core software for STAR experiment
26 stars 62 forks source link

StTpcHitMaker::maxHits array over-written #690

Open genevb opened 1 month ago

genevb commented 1 month ago

We have memory corruption when running the TPC offline cluster finder on the 130th event (eventID=2663503) of this file: /star/data03/daq/2024/175/25175022/st_physics_adc_25175022_raw_2300009.daq

The issue is reproducible in DEV by executing:

root4star -b -q -l 'bfc.C(130,130,"pp2024a,TpxRaw,TpxClu","star/data03/daq/2024/175/25175022/st_physics_adc_25175022_raw_2300009.daq")'

The issue shows up in the log file beginning with this line:

StTpcHitMaker:ERROR - Too many hits (0) in one sector (22). Skipping event.

This message is written because of something that happens at line 1137 of StTpcHitMaker... https://github.com/star-bnl/star-sw/blob/9811528b557897def0e070a5a3ca9279fea46f46/StRoot/StTpcHitMaker/StTpcHitMaker.cxx#L1137 ...which is...

    IDTs[tb] = 65535;

...where we write into the IDTs array past its end. It is sized for 512 unsigned shorts, but in this event, we are seeing tb surpass 512 when reading through the DAQ data for TPX sector 22. The value of tb is assigned just a few lines earlier from DAQ data:

for (;iword != DaqDta()->end();++iword) {
  daq_adc_tb &daqadc = (*(daq_adc_tb *)*iword);
  Int_t tb   = daqadc.tb;

And in StTpcHitMaker.h, the relevant data members are...

#ifdef __TFG__VERSION__
  Int_t    IDTs[512];
#else
  UShort_t IDTs[512];
#endif
  UShort_t fId; // current cluster Id
  Int_t    maxHits[24];

So as tb reaches 512, the assignment first overwrites fId, and at 513, it begins to overwrite maxHits, turning it into negative values, which then trigger the above message of "Too many hits".

I ran the job in valgrind, and found another data member in the DAQ TPC code introduced last year ("DAQ5k") that is evaluated before it is initialized (see PR #688 for the previous one)... https://github.com/star-bnl/star-sw/blob/9811528b557897def0e070a5a3ca9279fea46f46/StRoot/RTS/src/DAQ_TPC23/tpc23_base.h#L167 ...and https://github.com/star-bnl/star-sw/blob/9811528b557897def0e070a5a3ca9279fea46f46/StRoot/RTS/src/DAQ_TPC23/tpc23_base.cxx#L86 ...however, testing that fix locally did not resolve the "Too many hits" issue of this ticket. And nothing else in valgrind is obvious to me as a cause.

I think it remains an open question whether the tb value greater than 511 is an issue within the raw data, or a consequence of some other memory issue within the DAQ5k code. My impression is that we do not want to simply increase the size of the IDTs array to resolve this, as tb probably should never exceed 511. I will keep trying to understand, but I welcome anyone else taking a look at this too.

I assigned several names of people to this issue, with the intent that they be at the very least observers, if not helpers in resolving the issue.

-Gene

p.s. Credit to the Offline QA folks (Wenyun Bo and Lanny Ray) for spotting the impacts of this and reporting it.

genevb commented 1 month ago

A scan through FastOffline logs reveals that the issue was seen in 11 files over the past 2 weeks (older log files have been erased):

st_physics_adc_25169006_raw_0900009 st_physics_adc_25174042_raw_1000008 st_physics_adc_25174042_raw_2300008 st_physics_adc_25175022_raw_0300009 st_physics_adc_25175022_raw_2300009 st_physics_adc_25175027_raw_2100011 st_physics_adc_25175070_raw_1400006 st_physics_adc_25175073_raw_0500010 st_physics_adc_25175073_raw_1600009 st_physics_adc_25176012_raw_1600009 st_physics_adc_25176014_raw_1500009

DEV has not been modified since June 19th (day 171), which doesn't match with either the single occurrence on day 169, nor the increased rate of occurrences beginning on day 174.

-Gene

genevb commented 4 weeks ago

I see 3 more files to add to the list today to bring my total to 14.

For 8 of the files, I grabbed the DAQ files and processed to generate the following table...

File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel (ADC) data ...for where TB exceeds 511 (event,sector,row all start at 1, not 0) (NB: inner sectors normally end at TB 420, and outer sectors at TB 414)

st_physics_adc_25175022_raw_2300009 130 22 20 559 st_physics_adc_25175073_raw_0500010 248 7 37 544 st_physics_adc_25175073_raw_1600009 242 7 40 574 st_physics_adc_25176012_raw_1600009 9 9 27 524 st_physics_adc_25176014_raw_1500009 37 14 11 523 st_physics_adc_25176048_raw_0000008 1390 24 29 548 st_physics_adc_25177010_raw_0800009 45 3 1 541 st_physics_adc_25177010_raw_1100008 97 3 2 523

(NB: these are all iTPC rows <= 40)

There is a pair in run 25175073, sector 7, that may be from the same FEE as each other? I'm not sure as they are in rows 37 and 40.

There is another pair in run 25177010, sector 3, from padrows 1 and 2, perhaps also the same FEE as each other?

Otherwise, there isn't an obvious pattern to the sector and row.

But these pairings provide a second bit of non-definitive evidence that this is something in the TPC pixel data. The other such evidence being 100% reproducibility (we saw reproducibility below 20% when dealing with an uninitialized variable in the software a few weeks ago).

-Gene

fvidebaek commented 4 weeks ago

Important to understand it is reproducible. Would be useful to understand -- to see if it comes from similar FEE need to see the pad# too. -- are this largest timebin isolated i.e previous tb much earlier or consecutive. I am thinking bit error. -- we should involve Tonko to understand if his code expects to have 0<=tb<=511, and if there is any special handling for outside rage ?

On 2024-06-25 15:09, Gene Van Buren wrote:

I see 3 more files to add to the list today to bring my total to 14.

For 8 of the files, I grabbed the DAQ files and processed to generate the following table...

File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel (ADC) data ...for where TB exceeds 511 (event,sector,row all start at 1, not 0) (NB: inner sectors normally end at TB 420, and outer sectors at TB 414)

st_physics_adc_25175022_raw_2300009 130 22 20 559 st_physics_adc_25175073_raw_0500010 248 7 37 544 st_physics_adc_25175073_raw_1600009 242 7 40 574 st_physics_adc_25176012_raw_1600009 9 9 27 524 st_physics_adc_25176014_raw_1500009 37 14 11 523 st_physics_adc_25176048_raw_0000008 1390 24 29 548 st_physics_adc_25177010_raw_0800009 45 3 1 541 st_physics_adc_25177010_raw_1100008 97 3 2 523

(NB: these are all iTPC rows <= 40)

There is a pair in run 25175073, sector 7, that may be from the same FEE as each other? I'm not sure as they are in rows 37 and 40.

There is another pair in run 25177010, sector 3, from padrows 1 and 2, perhaps also the same FEE as each other?

Otherwise, there isn't an obvious pattern to the sector and row.

But these pairings provide a second bit of non-definitive evidence that this is something in the TPC pixel data. The other such evidence being 100% reproducibility (we saw reproducibility below 20% when dealing with an uninitialized variable in the software a few weeks ago).

-Gene

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you were assigned.Message ID: @.***>

Links:

[1] https://github.com/star-bnl/star-sw/issues/690#issuecomment-2189776323 [2] https://github.com/notifications/unsubscribe-auth/AUIALBBAV3QETGV4NGKCE5DZJG55PAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZG43TMMZSGM

-- Flemming Videbaek senior scientist, emeritus videbaek @ bnl.gov Brookhaven National Lab Physics Department Bldg 510D Upton, NY 11973

phone: 631-344-4106 cell : 631-681-1596

tonko-lj commented 4 weeks ago

Should be fixed. It's committed to CVS.

What happens is that an occasional ASIC channel gives nonsense. I normally check for it but I guess I didn't protect it well enough when running in Offline.

On Tue, Jun 25, 2024 at 9:37 PM fvidebaek @.***> wrote:

Important to understand it is reproducible. Would be useful to understand -- to see if it comes from similar FEE need to see the pad# too. -- are this largest timebin isolated i.e previous tb much earlier or consecutive. I am thinking bit error. -- we should involve Tonko to understand if his code expects to have 0<=tb<=511, and if there is any special handling for outside rage ?

On 2024-06-25 15:09, Gene Van Buren wrote:

I see 3 more files to add to the list today to bring my total to 14.

For 8 of the files, I grabbed the DAQ files and processed to generate the following table...

File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel (ADC) data ...for where TB exceeds 511 (event,sector,row all start at 1, not 0) (NB: inner sectors normally end at TB 420, and outer sectors at TB 414)

st_physics_adc_25175022_raw_2300009 130 22 20 559 st_physics_adc_25175073_raw_0500010 248 7 37 544 st_physics_adc_25175073_raw_1600009 242 7 40 574 st_physics_adc_25176012_raw_1600009 9 9 27 524 st_physics_adc_25176014_raw_1500009 37 14 11 523 st_physics_adc_25176048_raw_0000008 1390 24 29 548 st_physics_adc_25177010_raw_0800009 45 3 1 541 st_physics_adc_25177010_raw_1100008 97 3 2 523

(NB: these are all iTPC rows <= 40)

There is a pair in run 25175073, sector 7, that may be from the same FEE as each other? I'm not sure as they are in rows 37 and 40.

There is another pair in run 25177010, sector 3, from padrows 1 and 2, perhaps also the same FEE as each other?

Otherwise, there isn't an obvious pattern to the sector and row.

But these pairings provide a second bit of non-definitive evidence that this is something in the TPC pixel data. The other such evidence being 100% reproducibility (we saw reproducibility below 20% when dealing with an uninitialized variable in the software a few weeks ago).

-Gene

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you were assigned.Message ID: @.***>

Links:

[1] https://github.com/star-bnl/star-sw/issues/690#issuecomment-2189776323 [2]

https://github.com/notifications/unsubscribe-auth/AUIALBBAV3QETGV4NGKCE5DZJG55PAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZG43TMMZSGM

-- Flemming Videbaek senior scientist, emeritus videbaek @ bnl.gov Brookhaven National Lab Physics Department Bldg 510D Upton, NY 11973

phone: 631-344-4106 cell : 631-681-1596

— Reply to this email directly, view it on GitHub https://github.com/star-bnl/star-sw/issues/690#issuecomment-2189829103, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATFNJ4ES3M7WLERSFTLFWH3ZJHBILAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHAZDSMJQGM . You are receiving this because you were assigned.Message ID: @.***>

genevb commented 3 weeks ago

Jeff has included Tonko's fixes for this issue in PR #688 , so I expect this issue will be solved when that PR is merged.

plexoos commented 3 weeks ago

Are you suggesting merging #688 now?

Is Akio fine with reverting his previous modification (https://github.com/star-bnl/star-sw/pull/688/files#r1651407708)?

Is the only change needed to fix this issue the one in StRoot/RTS/src/DAQ_TPC23/itpc23.cxx in this commit (https://github.com/star-bnl/star-sw/pull/688/commits/314f90a7a11ac53f959482a8ff923b344c4b4d9d)?

genevb commented 3 weeks ago

Are you suggesting merging #688 now?

No....I think Akio's concern is appropriate.

Is the only change needed to fix this issue the one in StRoot/RTS/src/DAQ_TPC23/itpc23.cxx in this commit

I think we should move forward with all of the DAQ_TPC23 changes in that PR.