star-bnl / star-sw

Core software for STAR experiment
26 stars 63 forks source link

StEemcRaw::headersAreSick() when running reconstruction #495

Open veprbl opened 1 year ago

veprbl commented 1 year ago

Describe the bug

A symptom of an issue is visible in the logs:

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=43)<=0,  crID=101, skip it

To Reproduce

See #452

Expected behavior

No such messages appear

plexoos commented 1 year ago

Thanks for the report @veprbl

I've compared the corresponding logs from a few test jobs and it appears to me that the warnings you report can be seen in both ROOT5 and ROOT6 jobs. So, it does not look like something specific to ROOT6. The tests jobs I looked at were: 90, 102, 119, and some earlier chains don't show them at all.

Actually, the logs look almost identical for ROOT5 and ROOT6 which is a very good news given the fact that there is lots of numerical information dumped in the files such as number of tracks, vertex positions, etc.

Hopefully, @kkauder, the owner of StEmc* packages, can shed some light on the nature of these warnings.

kkauder commented 1 year ago

Sorry :-/ I was assigned as owner by mistake. I was briefly software coordinator for the BEMC back in 2016 but never had much to do with this code, and certainly not the EEMC

veprbl commented 1 year ago

Indeed, I see messages for both ROOTs using a relatively new testing dataset:

https://github.com/star-bnl/star-sw/blob/28b6c4fe0bf459c3f7f40a7752f0ba825617fff2/tests/joblist.json#L1100-L1108

ROOT5 (test 123)

https://github.com/star-bnl/star-sw/actions/runs/4119348287/jobs/7114791395#step:5:1767

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=6)<=0,  crID=64, skip it

ROOT6 (test 123)

https://github.com/star-bnl/star-sw/actions/runs/4119348287/jobs/7114796210#step:5:1997

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=6)<=0,  crID=64, skip it
gvisser commented 1 year ago

hi guys, This looks like a usual "crate header error" for ESMD. Data from "crates" showing this error may be bad in such runs. It sounds like such data is automatically skipped? That's probably a good thing. You could go back to STAR shift log for the runs in question and see if any issues were remarked on. Hope this comment helps some.

klendathu2k commented 1 year ago

This looks like a usual "crate header error" for ESMD. Data from "crates" showing this error may be bad in such runs. It sounds like such data is automatically skipped?

That is my recollection of what that warning implies. Looking at the code which generates it (StRoot/StEmcRawMaker/StEemcRaw.cxx), the condition is that the size of the header is reported as <=0. The data from that crate is skipped.

fgeurts commented 1 month ago

Hi,

If this is expected behavior (re: @gvisser's comment) for bad/skipped data, can we then close this as a software issue? Please advise.

gvisser commented 1 month ago

Frank, I think probably yes this issue should be closed. I can only suggest someone checks that runs which are known to be good (e.g. at least, did not have suchonline crate header messages when taken) do not make these errors. I hope this whole business has been about certain particular runs. For sure, we have had at times a dead crate or other corruption, sometimes noted in shiftlog though probably sometimes not mentioned. For sure I expect such errors to show in some runs, offline.

veprbl commented 1 month ago

@gvisser Unfortunately your criteria is not met. The issue is seen in a random test I've picked (125):

gvisser commented 1 month ago

ok... If you can help me to understand any way that I can be of assistance in this, I'll certainly try. I don't understand your first two links there, sorry. (Do I get to see the error messages about headers somehow through those links?) I agree from shiftlog that 23072022 looks ok, although sometimes there are problems not noted in the shiftlog of course. Perhaps we should try a new test run we could probably take tomorrow, if ESMD has been left on, I can ask Will about that. With a new run I can look at DAQ logfile in detail and see if any relevant messages there. What do you think, is it worth trying this?

fgeurts commented 1 month ago

I am also confused about what the issue is.

If I understand Gerard correctly, this is what a warning message looks like when a crate has an issue. So, presuming this is the case, what is wrong with using this warning?

Or, is the issue that we know there should be no crate issues, and the software still gives us the message.

In the former case, I propose we close this issue unless there is an objection to using this mechanism to warn about a potential data-integrity/crate issue. In the latter case, it would be good to have EMC experts fix this software bug.

veprbl commented 1 month ago

Actually, looking for the second time. The test above (125) is actually fine, producing singular messages like:

StEmcRawMaker:INFO  - StEemcRaw::headersAreSick() --> Nfibers=54

The suspicious test is (123) produces additional messages:

StEmcRawMaker:INFO  - StEemcRaw::headersAreSick() --> Nfibers=54
StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=6)<=0,  crID=64, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=7)<=0,  crID=65, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=8)<=0,  crID=66, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=9)<=0,  crID=67, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=10)<=0,  crID=68, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=11)<=0,  crID=69, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=12)<=0,  crID=70, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=13)<=0,  crID=71, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=14)<=0,  crID=72, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=15)<=0,  crID=73, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=16)<=0,  crID=74, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=17)<=0,  crID=75, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=18)<=0,  crID=76, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=19)<=0,  crID=77, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=20)<=0,  crID=78, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=21)<=0,  crID=79, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=22)<=0,  crID=80, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=23)<=0,  crID=81, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=24)<=0,  crID=82, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=25)<=0,  crID=83, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=26)<=0,  crID=84, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=27)<=0,  crID=85, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=28)<=0,  crID=86, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=29)<=0,  crID=87, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=30)<=0,  crID=88, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=31)<=0,  crID=89, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=32)<=0,  crID=90, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=33)<=0,  crID=91, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=34)<=0,  crID=92, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=35)<=0,  crID=93, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=36)<=0,  crID=94, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=37)<=0,  crID=95, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=38)<=0,  crID=96, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=39)<=0,  crID=97, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=40)<=0,  crID=98, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=41)<=0,  crID=99, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=42)<=0,  crID=100, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=43)<=0,  crID=101, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=44)<=0,  crID=102, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=45)<=0,  crID=103, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=46)<=0,  crID=104, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=47)<=0,  crID=105, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=48)<=0,  crID=106, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=49)<=0,  crID=107, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=50)<=0,  crID=108, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=51)<=0,  crID=109, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=52)<=0,  crID=110, skip it

StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=53)<=0,  crID=111, skip it

So the correct links would be:

But, again, I don't see issue reported in run logs.

I've clicked the gear button and then "Download log archive" which pulls are logs as text files. Grepping through it:

grep 'StEemcRaw::headersAreSick() sizeHeader' *.txt | cut -d : -f 1 | sort -u

I get that quite a few tests show this issue:

100_test (102, root5, gcc485).txt
101_test (102, root6, gcc485).txt
102_test (103, root5, gcc485).txt
103_test (103, root6, gcc485).txt
104_test (104, root5, gcc485).txt
105_test (104, root6, gcc485).txt
106_test (105, root5, gcc485).txt
107_test (105, root6, gcc485).txt
108_test (106, root5, gcc485).txt
109_test (106, root6, gcc485).txt
110_test (107, root5, gcc485).txt
111_test (107, root6, gcc485).txt
112_test (108, root5, gcc485).txt
113_test (108, root6, gcc485).txt
114_test (109, root5, gcc485).txt
115_test (109, root6, gcc485).txt
116_test (110, root5, gcc485).txt
117_test (110, root6, gcc485).txt
118_test (111, root5, gcc485).txt
119_test (111, root6, gcc485).txt
120_test (112, root5, gcc485).txt
121_test (112, root6, gcc485).txt
122_test (113, root5, gcc485).txt
123_test (113, root6, gcc485).txt
124_test (114, root5, gcc485).txt
125_test (114, root6, gcc485).txt
126_test (115, root5, gcc485).txt
127_test (115, root6, gcc485).txt
128_test (116, root5, gcc485).txt
129_test (116, root6, gcc485).txt
130_test (117, root5, gcc485).txt
131_test (117, root6, gcc485).txt
132_test (118, root5, gcc485).txt
133_test (118, root6, gcc485).txt
134_test (119, root5, gcc485).txt
135_test (119, root6, gcc485).txt
136_test (120, root5, gcc485).txt
137_test (120, root6, gcc485).txt
138_test (121, root5, gcc485).txt
139_test (121, root6, gcc485).txt
142_test (123, root5, gcc485).txt
143_test (123, root6, gcc485).txt
40_test (37, root5, gcc485).txt
41_test (37, root6, gcc485).txt
92_test (90, root5, gcc485).txt
93_test (90, root6, gcc485).txt
94_test (91, root5, gcc485).txt
95_test (91, root6, gcc485).txt
96_test (92, root5, gcc485).txt
97_test (92, root6, gcc485).txt
98_test (101, root5, gcc485).txt
99_test (101, root6, gcc485).txt

Next, if we ask for context on header size for icr=0

grep -B4 'StEemcRaw::headersAreSick() sizeHeader(icr=0)' *.txt

That correlates perfectly with disabled ETOW, with outputs like:

99_test (101, root6, gcc485).txt-2024-03-20T14:02:19.1882279Z StEmcRawMaker:ERROR - ETOW Structure not found
99_test (101, root6, gcc485).txt-2024-03-20T14:02:19.1890687Z StEmcRawMaker:ERROR - ESMD Structure not found
99_test (101, root6, gcc485).txt-2024-03-20T14:02:19.1891721Z StEmcRawMaker:INFO  - StEemcRaw::copyRawData() 0 data bloks copied
99_test (101, root6, gcc485).txt-2024-03-20T14:02:19.1892426Z StEmcRawMaker:INFO  - StEemcRaw::headersAreSick() --> Nfibers=54
99_test (101, root6, gcc485).txt:2024-03-20T14:02:19.1893195Z StEmcRawMaker:WARN  - StEemcRaw::headersAreSick() sizeHeader(icr=0)<=0,  crID=1, skip it

Selecting lines that precede icr=6, but not additional sizeHeader errors:

grep -B4 'StEemcRaw::headersAreSick() sizeHeader(icr=6)' *.txt | grep -v 'sizeHeader(icr' | grep '.\{70\}'

correlates with missing ESMD

97_test (92, root6, gcc485).txt-2024-03-20T15:14:54.6056032Z StEmcRawMaker:ERROR - ESMD Structure not found
97_test (92, root6, gcc485).txt-2024-03-20T15:14:54.6057297Z StEmcRawMaker:INFO  - StEemcRaw::copyRawData() 6 data bloks copied
97_test (92, root6, gcc485).txt-2024-03-20T15:14:54.6058403Z StEmcRawMaker:INFO  - StEemcRaw::headersAreSick() --> Nfibers=54

This makes me conclude that we are just seeing false alarms when detectors were included in the run, but detectors were down. This should be treated in the physics data analysis QA to some extent already. In principle, it would be nice to have those additional errors silenced for the cases when we already know that ESMD or ETOW are not found.

veprbl commented 1 month ago

ok... If you can help me to understand any way that I can be of assistance in this, I'll certainly try. I don't understand your first two links there, sorry. (Do I get to see the error messages about headers somehow through those links?)

Oh, sorry, I could have explained more. For the first link, you need to expand the collapsed logs behind Run TEST_CMD= and look for the error message. The second link is just pointer to a file that has a mapping from test id like (123) to an actual DAQ file and the chain options.

veprbl commented 1 month ago

I'm seeing there are two implementations of the check

https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L313 and that one, looks like it has some check for if the bank is in https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L329-L340 This was introduced in a5acc063d80b46519d2d9d5ce4335ff75c4297a3

But what we are seeing is a similar function https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L144 which originally had something https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L159-L170 but it's been commented for ~20 years

I don't know the context on why there are code paths going through the StEmcRawMaker vs StEEMCReader.

klendathu2k commented 1 month ago

Hi Dmitry,

The important thing to note is that these checks were implemented, as you note, O(20) years ago. These are sanity checks on the hardware, to ensure that corrupt events from hardware issues (identified 20 years ago) are not propagated into analysis. My vague recollection is that these hardware issues were present at some level in each run and maybe even in each event.

Do we have any indication from people doing analysis that massive amounts of data are being dropped? If not, my presumption would be that the software is behaving as intended. Filtering out bad data, and being verbose about it.

My $0.02, Jason

On 2024-03-27 00:18, Dmitry Kalinkin wrote:

I'm seeing there are two implementations of the check

https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L313 and that one, looks like it has some check for if the bank is in https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L329-L340 This was introduced in a5acc06 [1]

But what we are seeing is a similar function https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L144 which originally had something https://github.com/star-bnl/star-sw/blob/2e60a9d2fd09be29f0e0b4bab389522814cd9bef/StRoot/StEmcRawMaker/StEemcRaw.cxx#L159-L170 but it's been commented for ~20 years

I don't know the context on why there are code paths going through the StEmcRawMaker vs StEEMCReader.

-- Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. You are receiving this because you commented.Message ID: @.***>

Links:

[1] https://github.com/star-bnl/star-sw/commit/a5acc063d80b46519d2d9d5ce4335ff75c4297a3 [2] https://github.com/star-bnl/star-sw/issues/495#issuecomment-2021895030 [3] https://github.com/notifications/unsubscribe-auth/ANL4LVEQ4PFFSOSIA6ZTOK3Y2JCAPAVCNFSM6AAAAAAUYZ7VR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRRHA4TKMBTGA