Closed yipengsun closed 3 years ago
I think if we should wait for greenlight in #54 until tomorrow evening, but definitely submit it even if we don't have explicit greenlight by the end of tomorrow (Jan 12, 2021).
For #54, visually comparing the run 2 reco code to run 1, still as far as I can see everything is consistent.
However, out of curiosity, I just tried running a test job using the docker (just my 2015 production test, but with the newest changes in the origin/master
branch locally merged into my 2015_production
branch), and I got an error (I'll copy it below, but I'm not sure it's helpful). When I reset my local 2015_production
branch (to what it is currently at origin/2015_production
), though, I did not get any errors running the test job using the docker, suggesting some recent changes in the master branch are causing the error (I think).
Perhaps this error is because of something I'm doing, but I thought I'd give you @yipengsun a heads up in case you wanted to submit a test job now just to make sure nothing in the master branch is causing an error for you. If you can successfully run a test, I'd think this comment can be ignored.
Error:
physicist@docker-desktop> cd run2-rdx
physicist@docker-desktop> ./run.sh reco_Dst_D0.py conds/cond-std-2015.py
# setting LC_ALL to "C"
# --> Including file '/data/run2-rdx/reco_Dst_D0.py'
Traceback (most recent call last):
File "/opt/lhcb/lhcb/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/scripts/gaudirun.py", line 547, in <module>
exec (o, g, l)
File "<string>", line 1, in <module>
File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/ProcessJobOptions.py", line 502, in importOptions
File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/ProcessJobOptions.py", line 470, in _import_python
File "/data/run2-rdx/reco_Dst_D0.py", line 40, in <module>
DaVinci().Lumi = not DaVinci().Simulation
File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/Configurable.py", line 452, in __getattr__
File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/PropertyProxy.py", line 110, in __get__
AttributeError: Simulation
physicist@docker-desktop> exit
make: *** [Makefile:67: docker-dv] Error 1
Ah, this is because I updated the run.sh
script so that now it only takes a single argument:
./run.sh conds/cond-std-2015
I think this makes more sense, as we have only one reco script for RDX.
BTW, can you check if the documentation is outdated on this, and update it if needed?
Side note: Historically, we had more than 1 reco scripts, so there was a need to specify which reco script to use. Now that we've merged all scripts into 1, there's no such usecase anymore.
Ah ok, thanks. Documentation is updated.
@afernez My "for reference only" ganga job example has a problem: The decay mode for MC was not specified with the -d
flag, so it'll just proceed with some default value, which is wrong in our case. I've updated my top post for that.
Job for std reconstruction on 2016 MagDown real data submitted to the GRID.
Confusingly, in the bookkeeping, I don't see .dst files for the 2012 MC (mag down, sim09a, Pythia8, 11574020<->normalization mode) except some with a noRICHesSim
qualifier; ie. the path that you get from filling in these conditions to the bookkeeping path you already have in the ganga job scripts doesn't seem to exist.
In the ANA note chapter 2, indeed it seems Phoebe did produce some "noRICH" sim09 MC, but it also seems like this isn't the only MC produced... I'm not sure where the other .dst files might be.
In any case (since I'm posting this too late for anyone to see before tomorrow), I can get the production going using the noRICH files. I'll push my changes in the ganga job submitters with the current mc-2012
bookkeeping path commented out and the path for these files included, so you know.
Yeah, I think Sim09a in our case is exclusively RICHless, and you can submit the job.
In any case, I think improvements needed for the ganga job submitter. I see two problems right now:
ganga_jobs.py
, 1 in ganga_sample_jobs_parser.py
I've updated the ganga scripts to hopefully address these 2 problems. I also submitted 2016 MagUp real data GRID job to make sure at least it works in some cases.
@afernez BTW, the 00054936_00000076_1.dsttaunu.safestriptrig.dst
files don't seem to be available, did you copy them to julian
with:
git annex copy . --to=julian
git annex sync julian
I agree your changes to the ganga scripts make sense. And I think for the missing dst files, I assumed I was using git-annex
correctly when I added the files and then committed, and I proceeded to ignore the direction to copy to julian. Same mistake for the 2015 production dst files. I'll fix this now, thanks.
This comment should be ignored now
It looks like my production has finished, but there was one subjob that failed, and even though I used screen
so that ganga/lxplus never exited, still when I ran the command jobs(0).subjobs.select(status='failed').resubmit()
(job id was 0), ganga seemingly got stuck. Looking at the DIRAC monitoring, the subjob still has status as failed, too.
I guess on the bright side, my output was downloaded successfully. Ideally in the future I'll be able to follow the normal workflow and just resubmit jobs using a command like above, but for now do you think it would be acceptable to just generate an ntuple for the relevant LFN(s) locally, then copy that to lxplus (replacing the empty, failed ntuple) and merge everything together? I can find the relevant LFN from the failed subjob's log file.
To try to address this for the future- Will and Zishuo didn't give me an indication they had a problem exactly like this, so I might email a question to lhcb-distributed-analysis@cern.ch
.
Huh, well, I guess I should have been more patient (and should have used screen
earlier so my connection wasn't broken)- after letting ganga run in its seemingly stuck state for a few hours, it gave me a series of error messages and exited, and now it seems to be fine. The subjob resubmissions worked, so the 2012 MC (mag down, sim09a, Pythia8, 11574020<->normalization mode, noRICH) is finished (upon a quick inspection, everything looks good, and the outputs are merged).
The resubmitted subjobs for the 2015 data production are running now. @yipengsun am I right that the correct destination for these ntuples is in run1-rdx/samples
(for 2012 MC) and run2-rdx/samples
(for 2015 data)?
No. I think for large production ntuples, we should put them in
ntuples/0.9.3-production_for_validation
The production_for_validation
is just a name so that after 1 year we still kind know what it is, feel free to find a better description.
Also, I'm kind shocked that the MC production has already finished? Last time I think it took me around a full week for that (well, this also depends on how many jobs are running on the GRID).
More specifically, for our upcoming large production ntuples:
ntuples/0.9.3-production_for_validation/Dst_D0-mc
ntuples/0.9.3-production_for_validation/Dst_D0-std
@afernez I've merged your 2015_production
branch. In the future, you can add the 2015 GRID production to master directly.
I think the MC GRID job finished rather quickly. At first I thought this was weird, but then I checked the sim08a file and it is only ~180 MiB. I guess the "running for 1 week" was for bare MC cutflow (an edge case really).
Most of the MagDown subjobs have failed. Yet the MagUp subjobs are running fine. Maybe the GRID is experiencing some problems?
Well, some of the MagUp subjobs have also failed. Here are the error messages from one of the MagUp subjob:
Error in <TNetXNGFile::ReadBuffers>: [ERROR] Operation expired
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:7100010, branch:_Event., entry:100, badread=1, nerrors=1, basketnumber=2
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=2, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=3, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=4, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=5, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=6, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=7, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=8, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=9, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=10, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Here is some possible explanations of the fKeylen incorrect
errors. I think it is possible that the GRID node has some corrupted disk.
Evidence: Device or resource busy
on the subjob status.
I'm not sure I could provide any useful suggestion for why so many of your subjobs for these productions failed.
Regarding my 2012 MC finishing in ~1 day instead of ~1 week like the bare MC production you did; the sim09a files look like they total about 343 GB, so my guess would be pretty much what you said- maybe the jobs were in the submitting state for a long time and waiting to be matched (mine were matched basically instantly), or just running more slowly, because of higher GRID traffic. Or maybe it just took more time to fill in each branch because 'bare' doesn't have the stripping line cuts.
By the way, for the actual comparison I'm meant to be doing (our run 1 MC vs Phoebe's), do you happen to know where Phoebe's 2012 MC (magdown, etc) ntuple would be located?
It would be the same 15 GB file you used to find the various MC components in the other script, ref-rdx-run1/Dst-mix/Dst--20_07_02--mix--all--2011-2012--md-mu--phoebe.root
. Though I'm not sure it that one would allow you to separate sim09.
Perhaps a better option is to use Phoebe's step 1 ntuples. We wouldn't be able to compare variables like mmiss
or q2
, but the rest should be fine, https://cernbox.cern.ch/index.php/apps/files/?dir=/__myshares/TUPLES%20(id%3A263208)&
"This directory is unavailable", when I use the link you provided, maybe there's some permission problems? Also, I think we can add Phoebe's sim09a MC ntuples to ntuples/ref-rdx-run1/Dst-mc
and name it as Dst--21_01_18--mc--Bd2DstMuNu--2012--md--py8-sim09a-phoebe.root
2016 MagUp has finished but has 1 corrupt root files (this is discovered by hadd
). Need to resubmit that subjob.
@afernez Thanks for annexing both 2012 MC MagDown sim09a and 2015 data MagDown!
Still, there's one more step for the naming of the ntuples (the doc was unclear about this; just updated it).
Could you follow the updated 5th step in this section of the doc? Note that I also changed the output of ganga_sample_jobs_parser.py
, so please do a git pull
before you start.
Thanks again.
Thanks for checking this. The names should be correct now.
I've submitted a job to the GRID for 2012 MC normalization (B -> D* mu nu), sim08e magdown pythia8. It's running now and will probably finish some time tomorrow.
I've updated the KNOWN run 2 MC IDs at here. Out of 25 samples in run 1, 14 modes are listed in production numbers (74233, 74234) given by Svende.
I've submitted the MagDown of all 14 of them. @Svende could you give me some info on how to find the missing ones (the ones with ?
prefixed in their MC IDs).
FYI @manuelfs
Those are the other request IDs for the rest of the FullSim samples 74509,74494, you find all the information here: https://its.cern.ch/jira/browse/LHCBGAUSS-2153 or for the other request https://its.cern.ch/jira/browse/LHCBGAUSS-2146. Let me know if you need any other information.
Thanks for updating the remaining MC ID for me! I just did a double check, and I think I agree with your changes (I'm unclear about some of the D** modes, so I'll trust your judgement).
BTW, do you agree with the IDs that are updated by me?
sure, no problem! I just found a typo from me and fixed it, yours look fine too
The production for run 2 full sim MC MagDown has finished. I'm copying these ntuples to Julian and it should be available in ~6 hrs.
I could start making some plots to look at these samples. Right now when I try to download the ntuples though (after pulling and syncing), git annex get <.root file>
fails and says the file isn't available. Are they still being copied to julian?
They have copied to Julian, but I forgot to do another git annex sync
. They should be available now.
Conclusion for 2016 data production:
Note that for the MagUp ganga job, its status was stuck in "completing" for a long time and I have to frequently do job[n].backend.reset()
until its status finally changed to "completed".
I'll fix the 2016 MagUp real data at a later stage. Consider all ntuples needed for run 1/2 data and MC comparison produced.
We should submit the following 2 jobs:
I suggest that we write down the invocation of
ganga_jobs.py
in a script for archival purpose. For example, to submit MC (for reference only):We should test the
reco_Dst_D0.py
+ cond file combo produces a ntuple that contains events locally with our DaVinci docker before proceed.Also, I propose that we store our scripts in
jobs
folder insiderun1-rdx
andrun2-rdx
. The naming of these scripts can be cavalier, but I suggest the following convention:For MC, I'd suggest
sim09-normalization-md
, for data,std-2016-md
.