star-bnl / star-sw

Core software for STAR experiment
26 stars 63 forks source link

Fix misconfigured alignment correction in #358, change default start time in eTOF #372

Closed PhilippWeidenkaff closed 1 year ago

PhilippWeidenkaff commented 1 year ago

Additional notes:

This patch (PR #372) fixes a bug missed while introducing misalignment parameters for ETOF counters in PR #156 The code was reading a wrong number of entries from the database thus disabling the correction effect for all channels

starsdong commented 1 year ago

Philipp, to be clear, is this PR needed for the Run20 FXT picoDst reproduction? If yes, we need to include this to the SL22b library then.

PhilippWeidenkaff commented 1 year ago

Yes, without it, etof time-of-flights will be shifted slightly.

Best Regards Philipp

June 27, 2022 7:12 PM, "Xin Dong" @. @*.**@*.***>)> wrote: Philipp, to be clear, is this PR needed for the Run20 FXT picoDst reproduction? If yes, we need to include this to the SL22b library then.

Reply to this email directly, view it on GitHub (https://github.com/star-bnl/star-sw/pull/372#issuecomment-1167618540), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AIXRR2R47JZLZOZ7IRDAAULVRHOH3ANCNFSM5Z6K52NQ). You are receiving this because you authored the thread.Message ID: @.***>

plexoos commented 1 year ago

The title and the description of this PR need to be updated to reflect the primary reason for the fix. I see a quite severe bug fixed when a wrong number of alignment values is copied from the DB table into an internal container mAlignmentParameters. The newly introduced check for the number of expected parameters in mAlignmentParameters is too late to really check if "no alignment database table is found". While the table is now pretty much guaranteed to be available in the database it is still a good idea to follow the suggestion in https://github.com/star-bnl/star-sw/pull/358#issuecomment-1164903415 and check for valid dbDataSet

plexoos commented 1 year ago

I am ok with the fixes but is it possible to see an evidence demonstrating the problems being fixed?

PhilippWeidenkaff commented 1 year ago

Hi Dmitry,

I suppose you are asking about the geometry changes, right? The crash due to the wrongly used etofConstants::nCounters is evidently fixed as the automatic test that found it in the first place is now succeding.

I also tested the changes with regards to the error handling if there is no database entry found. To test this, i replaced the dbDataSet initialization with the line below:

TDataSet* dbDataSet = StMaker::GetChain()->GetDataBase("Geometry/etof/etofAlignblablubb");

This dataset obviously can't exist, so it has to throw an error now. What happens is what you see below:

StETofMatchMaker:INFO - TGeoManager::CloseGeometry : ----------------modeler ready---------------- StETofMatchMaker:ERROR - Initialization of StEtofGeometry failed StETofMatchMaker:INFO - StETofMatchMaker::Make(): made it to A StETofMatchMaker:INFO - number of ETOF hits: 6 StETofMatchMaker:ERROR - ETOF volume for moduleId 15 and counter 1 is not loaded ... StETofMatchMaker:ERROR - ETOF volume of a hit is not loaded in the geometry StETofMatchMaker:ERROR - ETOF volume for moduleId 15 and counter 1 is not loaded ... StETofMatchMaker:FATAL - TUnixSystem::DispatchSignals : segmentation violation Root > Function bfcRunnerVanilla() busy flag cleared The first log error line that you can see comes from the code lines below in StEtofMatchMaker::InitRun():

if( mETofGeom && !mETofGeom->isInitDone() ) { //if initialization attempt above failed. LOG_ERROR << "Initialization of StEtofGeometry failed" << endm; return kStErr; }

So, the LOG_ERROR is executed and the method returns kStErr. My understanding so far was that StMaker::InitRun() stops execution of the chain or skips the file if kStErr is returned. Apparently, that is not the case, the maker runs on and then predictably crashes. I tried to replace kStErr with kStFatal, but it leads to the same result. Is there another way i should handle this?

Best Regards Philipp June 30, 2022 3:27 PM, "Dmitri Smirnov" @. @*.**@*.***>)> wrote: I am ok with the fixes but is it possible to see an evidence demonstrating the problems being fixed?

Reply to this email directly, view it on GitHub (https://github.com/star-bnl/star-sw/pull/372#issuecomment-1171220219), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AIXRR2ULHWKXJHAVATGYHVDVRWOFLANCNFSM5Z6K52NQ). You are receiving this because you authored the thread.Message ID: @.***>

plexoos commented 1 year ago

I suppose you are asking about the geometry changes, right?

Yes, my concern is about this fix:

    float tempZ;

-    for( int iCounter = 0; iCounter < eTofConst::nCounters; iCounter++){
+    for( int iCounter = 0; iCounter < eTofConst::nCounters * eTofConst::nModules; iCounter++){
        tempX = table[iCounter].detectorAlignX[0];
        tempY = table[iCounter].detectorAlignY[0];
        tempZ = table[iCounter].detectorAlignZ[0];
        StThreeVectorD counterAlignmentParameter = StThreeVectorD( tempX, tempY, tempZ );
        mAlignmentParameters.push_back( counterAlignmentParameter );
    }

The crash due to the wrongly used etofConstants::nCounters is evidently fixed as the automatic test that found it in the first place is now succeding.

That is not entirely true. The automatic tests in CI never failed with or without the fix. The nightly tests at SDCC failed as reported by Gene here https://github.com/star-bnl/star-sw/pull/358#issuecomment-1164881409 but since then there was no problem and the log file looks good /star/rcf/test/dev/daq_sl302.ittf/Thu/year_2020/production_11p5GeV_2020/st_physics_20351078_raw_5000001.log So, whatever "fixed" the problem reported by Gene was not your fix because the code on the farm still does not include it, i.e. we have not merged this PR yet.

The fix looks reasonable though as the loop now goes to 108 as expected rather than to 3 but we'd learn more from this unfortunately overlooked bug if we see the effect in the results. I assume you already ran your tests to verify that this time everything works as expected. Perhaps you already have some before and after plots.

PhilippWeidenkaff commented 1 year ago

Hi Dmitri,

I submitted this PR to fix the start time default in the Matchmaker ( i can show you some plots before/after for this). At the same time, Gene informed me about two crashes after merging PR # 358. One of those crashes (on a file from 2019) was due to a missing db table for the alignment. This bug was fixed by uploading a database entry at an very early timestamp, so the table should now always be available. Just to be sure, i also added the changes to the geometry that you see here to make it "crash controlled" when no alignment table is available. The result of this is what i showed you in the last mail.

While doing this, i also used the opportunity to replace all hardcoded "108"s in StEtofGeometry.cxx by eTofConst::nCounters (Which was wrong). I this check from the file (line 638ff in the repositiory version). if( mAlignmentParameters.size() == 108 ){ //alignment parameters for all counters available mETofModule[ mNValidModules-1 ]->addCounter( gpNode, dx, dy, moduleId, counterId, safetyMargins, mAlignmentParameters.at(iCounterAlignment) ); iCounterAlignment++; } else { mETofModule[ mNValidModules-1 ]->addCounter( gpNode, dx, dy, moduleId, counterId, safetyMargins ); }

With this check in, the repository version will just continue without any alignments if the vector has the wrong size and does not crash. Removing it caused the automatic tests of the pull request to fail and i fixed it by the next commit. Now this automatic test of the pull request works.

This means the aligment never worked in the repository version. It confuses me a bit, because i definitely remember testing that. I think what happend is that i tested it with a hardcoded 108 and then replaced that with the wrong eTofConst::nCounters before commiting in order to make it look nicer... :(

Anyways, attached is an example how the matching deviations look like with both versions. The second crash that Gene informed me of was on a file from this year without any etof data. Here, MatchMaker crashed during ::Finish() with a bus error shortly after StMuDstMaker throw a kFatal:

StMuDstMaker:FATAL - TUnixSystem::DispatchSignals : bus error .... .... StETofMatchMaker:FATAL - TUnixSystem::DispatchSignals : bus error

To be honest, i'm unsure what that even means or where it comes from. I'm open for suggestions from anyone.

Best Regards Philipp

plexoos commented 1 year ago

attached is an example how the matching deviations look like with both versions.

Unfortunately, attachments don't work when you reply to GitHub's emails. Instead, you can insert a link to your file or drag and drop images in the comment area when you use the web interface. Also, can you be more specific which versions you compare? My expectation is that you compare the head of the main branch in star-bnl with this PR branch containing all the changes, but please confirm. I think if your tests show that the alignment is now activated and works for the code in this PR then we are done.

Regarding the "bus error", I checked some recent log files in /star/rcf/test/dev/ but didn't find anything related. It does not seem reproducible so, I wouldn't worry about it now.

PhilippWeidenkaff commented 1 year ago

STAR-04-07-22-Alignment.pdf

plexoos commented 1 year ago

Thank you, Philipp for this perfect illustration of misalignment correction. I'll assume that all of the counters are affected in a similar manner.

PhilippWeidenkaff commented 1 year ago

Hello Dmitry,

The intention behind this pull request was exactly what is stated in the description. Without the default start time change, all eTOF times will be shifted by 150ps compared to their correct value.

I added the extra check for the parameter size when Gene informed me that there is a crash with an empty table. Right now, the readAlignmentDatabase() method returns without filling the vector if no data structure is found, so the vector will have the wrong size in this case. This will cause the StEtofGeometry::Init() method to return without setting the IsInitDone() flag, which will then cause a kFatal in the ::Init() of the MatchMaker. Following your suggestion, I will also add a similar direct return to the readAlignmentDatabase() method if dbDataSet is not a valid pointer. I'm not sure if there is a more direct way to stop the chain from a non-maker class, which i think is what should happen if called database entries are missing.

The bug fix in the latest commit was due to a bug introduced in a supposed cosmetic change (remember that the vector size check always compared to a hard-coded 108 before?). I missremembered what EtofConstants::nCounters is supposed to mean (it is counters per module = 3 instead of all counters in system = 108) so i had to fix this after the automatic tests failed.

Best Regards Philipp

June 28, 2022 8:56 PM, "Dmitri Smirnov" @. @*.**@*.***>)> wrote: The title and the description of this PR need to be updated to reflect the primary reason for the fix. I see a quite severe bug fixed when a wrong number of alignment values is copied from the DB table into an internal container mAlignmentParameters. The newly introduced check for the number of expected parameters in mAlignmentParameters is too late to really check if "no alignment database table is found". While the table is now pretty much guaranteed to be available in the database it is still a good idea to follow the suggestion in #358 (comment) (https://github.com/star-bnl/star-sw/pull/358#issuecomment-1164903415) and check for valid dbDataSet

Reply to this email directly, view it on GitHub (https://github.com/star-bnl/star-sw/pull/372#issuecomment-1169106246), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AIXRR2WMSBN2CWTBTC6CY3TVRNDHNANCNFSM5Z6K52NQ). You are receiving this because you authored the thread.Message ID: @.***>

genevb commented 1 year ago

Just a follow-up note that many classes that read from the database are not prepared for handling the case where there is no valid (acceptable) database entry. Instead, for those, there is an entry in the database which serves as a default. Such an entry has both beginTime and entryTime (falsified) to be 1996-01-01, so that all data and all real DbV values would accept that default entry as a candidate. Often times such an entry's contents are just zeros.