GRID jobs for facilitating run 1/2 comparisons

yipengsun commented 3 years ago

We should submit the following 2 jobs:

@afernez: 2012 MC MagDown Sim09 normalization
@yipengsun 2016 Data MagDown

I suggest that we write down the invocation of ganga_jobs.py in a script for archival purpose. For example, to submit MC (for reference only):

#!/usr/bin/env bash

../../scripts/ganga/ganga_sample_jobs_parser.py ../reco_Dst_D0.py ../conds/cond-mc-2012-md-sim09a.py -p md -d Bd2DstMuNu

We should test the reco_Dst_D0.py + cond file combo produces a ntuple that contains events locally with our DaVinci docker before proceed.

Also, I propose that we store our scripts in jobs folder inside run1-rdx and run2-rdx. The naming of these scripts can be cavalier, but I suggest the following convention:

run1-rdx/jobs/YY_MM_DD-<description>.sh

For MC, I'd suggest sim09-normalization-md, for data, std-2016-md.

yipengsun commented 3 years ago

I think if we should wait for greenlight in #54 until tomorrow evening, but definitely submit it even if we don't have explicit greenlight by the end of tomorrow (Jan 12, 2021).

afernez commented 3 years ago

For #54, visually comparing the run 2 reco code to run 1, still as far as I can see everything is consistent. However, out of curiosity, I just tried running a test job using the docker (just my 2015 production test, but with the newest changes in the origin/master branch locally merged into my 2015_production branch), and I got an error (I'll copy it below, but I'm not sure it's helpful). When I reset my local 2015_production branch (to what it is currently at origin/2015_production), though, I did not get any errors running the test job using the docker, suggesting some recent changes in the master branch are causing the error (I think). Perhaps this error is because of something I'm doing, but I thought I'd give you @yipengsun a heads up in case you wanted to submit a test job now just to make sure nothing in the master branch is causing an error for you. If you can successfully run a test, I'd think this comment can be ignored.

Error:

physicist@docker-desktop> cd run2-rdx
physicist@docker-desktop> ./run.sh reco_Dst_D0.py conds/cond-std-2015.py
# setting LC_ALL to "C"
# --> Including file '/data/run2-rdx/reco_Dst_D0.py'
Traceback (most recent call last):
  File "/opt/lhcb/lhcb/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/scripts/gaudirun.py", line 547, in <module>
    exec (o, g, l)
  File "<string>", line 1, in <module>
  File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/ProcessJobOptions.py", line 502, in importOptions
  File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/ProcessJobOptions.py", line 470, in _import_python
  File "/data/run2-rdx/reco_Dst_D0.py", line 40, in <module>
    DaVinci().Lumi = not DaVinci().Simulation
  File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/Configurable.py", line 452, in __getattr__
  File "/workspace/build/GAUDI/GAUDI_v33r0/InstallArea/x86_64-centos7-gcc9-opt/python/GaudiKernel/PropertyProxy.py", line 110, in __get__
AttributeError: Simulation
physicist@docker-desktop> exit
make: *** [Makefile:67: docker-dv] Error 1

yipengsun commented 3 years ago

Ah, this is because I updated the run.sh script so that now it only takes a single argument:

./run.sh conds/cond-std-2015

I think this makes more sense, as we have only one reco script for RDX.

BTW, can you check if the documentation is outdated on this, and update it if needed?

yipengsun commented 3 years ago

Side note: Historically, we had more than 1 reco scripts, so there was a need to specify which reco script to use. Now that we've merged all scripts into 1, there's no such usecase anymore.

afernez commented 3 years ago

Ah ok, thanks. Documentation is updated.

yipengsun commented 3 years ago

@afernez My "for reference only" ganga job example has a problem: The decay mode for MC was not specified with the -d flag, so it'll just proceed with some default value, which is wrong in our case. I've updated my top post for that.

yipengsun commented 3 years ago

Job for std reconstruction on 2016 MagDown real data submitted to the GRID.

afernez commented 3 years ago

Confusingly, in the bookkeeping, I don't see .dst files for the 2012 MC (mag down, sim09a, Pythia8, 11574020<->normalization mode) except some with a noRICHesSim qualifier; ie. the path that you get from filling in these conditions to the bookkeeping path you already have in the ganga job scripts doesn't seem to exist. In the ANA note chapter 2, indeed it seems Phoebe did produce some "noRICH" sim09 MC, but it also seems like this isn't the only MC produced... I'm not sure where the other .dst files might be.

In any case (since I'm posting this too late for anyone to see before tomorrow), I can get the production going using the noRICH files. I'll push my changes in the ganga job submitters with the current mc-2012 bookkeeping path commented out and the path for these files included, so you know.

yipengsun commented 3 years ago

Yeah, I think Sim09a in our case is exclusively RICHless, and you can submit the job.

In any case, I think improvements needed for the ganga job submitter. I see two problems right now:

We are maintaining 2 separate truth: 1 in ganga_jobs.py, 1 in ganga_sample_jobs_parser.py
There's only 1 LFN for MC. I guess for the RICHless, it's easier to have its own LFN.

yipengsun commented 3 years ago

I've updated the ganga scripts to hopefully address these 2 problems. I also submitted 2016 MagUp real data GRID job to make sure at least it works in some cases.

yipengsun commented 3 years ago

@afernez BTW, the 00054936_00000076_1.dsttaunu.safestriptrig.dst files don't seem to be available, did you copy them to julian with:

git annex copy . --to=julian
git annex sync julian

afernez commented 3 years ago

I agree your changes to the ganga scripts make sense. And I think for the missing dst files, I assumed I was using git-annex correctly when I added the files and then committed, and I proceeded to ignore the direction to copy to julian. Same mistake for the 2015 production dst files. I'll fix this now, thanks.

afernez commented 3 years ago

This comment should be ignored now

It looks like my production has finished, but there was one subjob that failed, and even though I used screen so that ganga/lxplus never exited, still when I ran the command jobs(0).subjobs.select(status='failed').resubmit() (job id was 0), ganga seemingly got stuck. Looking at the DIRAC monitoring, the subjob still has status as failed, too.

I guess on the bright side, my output was downloaded successfully. Ideally in the future I'll be able to follow the normal workflow and just resubmit jobs using a command like above, but for now do you think it would be acceptable to just generate an ntuple for the relevant LFN(s) locally, then copy that to lxplus (replacing the empty, failed ntuple) and merge everything together? I can find the relevant LFN from the failed subjob's log file.

To try to address this for the future- Will and Zishuo didn't give me an indication they had a problem exactly like this, so I might email a question to lhcb-distributed-analysis@cern.ch.

afernez commented 3 years ago

Huh, well, I guess I should have been more patient (and should have used screen earlier so my connection wasn't broken)- after letting ganga run in its seemingly stuck state for a few hours, it gave me a series of error messages and exited, and now it seems to be fine. The subjob resubmissions worked, so the 2012 MC (mag down, sim09a, Pythia8, 11574020<->normalization mode, noRICH) is finished (upon a quick inspection, everything looks good, and the outputs are merged).

The resubmitted subjobs for the 2015 data production are running now. @yipengsun am I right that the correct destination for these ntuples is in run1-rdx/samples (for 2012 MC) and run2-rdx/samples (for 2015 data)?

yipengsun commented 3 years ago

No. I think for large production ntuples, we should put them in

ntuples/0.9.3-production_for_validation

The production_for_validation is just a name so that after 1 year we still kind know what it is, feel free to find a better description.

yipengsun commented 3 years ago

Also, I'm kind shocked that the MC production has already finished? Last time I think it took me around a full week for that (well, this also depends on how many jobs are running on the GRID).

yipengsun commented 3 years ago

More specifically, for our upcoming large production ntuples:

For 2012 MC: ntuples/0.9.3-production_for_validation/Dst_D0-mc
For 2015 data: ntuples/0.9.3-production_for_validation/Dst_D0-std

yipengsun commented 3 years ago

@afernez I've merged your 2015_production branch. In the future, you can add the 2015 GRID production to master directly.

yipengsun commented 3 years ago

I think the MC GRID job finished rather quickly. At first I thought this was weird, but then I checked the sim08a file and it is only ~180 MiB. I guess the "running for 1 week" was for bare MC cutflow (an edge case really).

yipengsun commented 3 years ago

Most of the MagDown subjobs have failed. Yet the MagUp subjobs are running fine. Maybe the GRID is experiencing some problems?

yipengsun commented 3 years ago

Well, some of the MagUp subjobs have also failed. Here are the error messages from one of the MagUp subjob:

Error in <TNetXNGFile::ReadBuffers>: [ERROR] Operation expired
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:7100010, branch:_Event., entry:100, badread=1, nerrors=1, basketnumber=2
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=2, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=3, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=4, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=5, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=6, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=7, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=8, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=9, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBranchElement::GetBasket>: File: root://xrootd.echo.stfc.ac.uk/lhcb:prod/lhcb/LHCb/Collision16/SEMILEPTONIC.DST/00070444/0001/00070444_00011111_1.semileptonic.dst at byte:3891117133, branch:_Event., entry:600, badread=1, nerrors=10, basketnumber=12
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-6998) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1964164215) ; trying to recover by setting it to zero

yipengsun commented 3 years ago

Here is some possible explanations of the fKeylen incorrect errors. I think it is possible that the GRID node has some corrupted disk.

Evidence: Device or resource busy on the subjob status.

afernez commented 3 years ago

I'm not sure I could provide any useful suggestion for why so many of your subjobs for these productions failed.

Regarding my 2012 MC finishing in ~1 day instead of ~1 week like the bare MC production you did; the sim09a files look like they total about 343 GB, so my guess would be pretty much what you said- maybe the jobs were in the submitting state for a long time and waiting to be matched (mine were matched basically instantly), or just running more slowly, because of higher GRID traffic. Or maybe it just took more time to fill in each branch because 'bare' doesn't have the stripping line cuts.

afernez commented 3 years ago

By the way, for the actual comparison I'm meant to be doing (our run 1 MC vs Phoebe's), do you happen to know where Phoebe's 2012 MC (magdown, etc) ntuple would be located?

manuelfs commented 3 years ago

It would be the same 15 GB file you used to find the various MC components in the other script, ref-rdx-run1/Dst-mix/Dst--20_07_02--mix--all--2011-2012--md-mu--phoebe.root. Though I'm not sure it that one would allow you to separate sim09.

Perhaps a better option is to use Phoebe's step 1 ntuples. We wouldn't be able to compare variables like mmiss or q2, but the rest should be fine, https://cernbox.cern.ch/index.php/apps/files/?dir=/__myshares/TUPLES%20(id%3A263208)&

yipengsun commented 3 years ago

"This directory is unavailable", when I use the link you provided, maybe there's some permission problems? Also, I think we can add Phoebe's sim09a MC ntuples to ntuples/ref-rdx-run1/Dst-mc and name it as Dst--21_01_18--mc--Bd2DstMuNu--2012--md--py8-sim09a-phoebe.root

yipengsun commented 3 years ago

2016 MagUp has finished but has 1 corrupt root files (this is discovered by hadd). Need to resubmit that subjob.

yipengsun commented 3 years ago

@afernez Thanks for annexing both 2012 MC MagDown sim09a and 2015 data MagDown!

Still, there's one more step for the naming of the ntuples (the doc was unclear about this; just updated it).

Could you follow the updated 5th step in this section of the doc? Note that I also changed the output of ganga_sample_jobs_parser.py, so please do a git pull before you start.

Thanks again.

afernez commented 3 years ago

Thanks for checking this. The names should be correct now.

afernez commented 3 years ago

I've submitted a job to the GRID for 2012 MC normalization (B -> D* mu nu), sim08e magdown pythia8. It's running now and will probably finish some time tomorrow.

yipengsun commented 3 years ago

I've updated the KNOWN run 2 MC IDs at here. Out of 25 samples in run 1, 14 modes are listed in production numbers (74233, 74234) given by Svende.

I've submitted the MagDown of all 14 of them. @Svende could you give me some info on how to find the missing ones (the ones with ? prefixed in their MC IDs).

FYI @manuelfs

Svende commented 3 years ago

Those are the other request IDs for the rest of the FullSim samples 74509,74494, you find all the information here: https://its.cern.ch/jira/browse/LHCBGAUSS-2153 or for the other request https://its.cern.ch/jira/browse/LHCBGAUSS-2146. Let me know if you need any other information.

yipengsun commented 3 years ago

Thanks for updating the remaining MC ID for me! I just did a double check, and I think I agree with your changes (I'm unclear about some of the D** modes, so I'll trust your judgement).

BTW, do you agree with the IDs that are updated by me?

Svende commented 3 years ago

sure, no problem! I just found a typo from me and fixed it, yours look fine too

yipengsun commented 3 years ago

The production for run 2 full sim MC MagDown has finished. I'm copying these ntuples to Julian and it should be available in ~6 hrs.

afernez commented 3 years ago

I could start making some plots to look at these samples. Right now when I try to download the ntuples though (after pulling and syncing), git annex get <.root file> fails and says the file isn't available. Are they still being copied to julian?

yipengsun commented 3 years ago

They have copied to Julian, but I forgot to do another git annex sync. They should be available now.

yipengsun commented 3 years ago

Conclusion for 2016 data production:

MagUp: total size: 86 GB, but some of the GRID ntuples are corrupted, so probably will need to resubmit
MagDown: total size: 89 GB, all ntuples are fine

Note that for the MagUp ganga job, its status was stuck in "completing" for a long time and I have to frequently do job[n].backend.reset() until its status finally changed to "completed".

yipengsun commented 3 years ago

I'll fix the 2016 MagUp real data at a later stage. Consider all ntuples needed for run 1/2 data and MC comparison produced.

umd-lhcb / lhcb-ntuples-gen

GRID jobs for facilitating run 1/2 comparisons #55