umd-lhcb / lhcb-ntuples-gen

ntuples generation with DaVinci and in-house offline components
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

v0.9.6 GRID ntuple productions #95

Closed yipengsun closed 2 years ago

yipengsun commented 2 years ago

Here we list all GRID ntuple productions for v0.9.6.

General production plan for 0.9.6

The main idea is: In this version, we produce ALL required ntuples for 2016, for both polarities.

We'll use the produced ntuples to fully setup a fit for year 2016, make sure these templates work (in the sense of good convergence); at the same time, do cut optimization to see which cuts can be embedded in DaVinci directly.

Note: As a part of cut optimization, we need to validate that optimized cuts work with our current fitter (plus some minimal changes, if needed).

After all these steps, we'll produce 2017 and 2018 ntuples in the next version 0.9.7.

Divide the production among Alex, Manuel, and Yipeng

# Sample Name MC ID TOTAL [M] 2015 [M] 2016 [M] 2017 [M] 2018 [M]
1 D0 B- → D0 μ ν 12573012 161.14 7.90 45.44 47.37 60.43
2 D0/D*+ B0 → D*+ μ ν 11574021 274.86 13.47 77.51 80.81 103.07
3 D0 B- → D*0 μ ν 12773410 452.40 22.17 127.58 133.00 169.65
4 D0 B- → D0 τ ν 12573001 11.08 0.54 3.13 3.26 4.16
5 D0/D*+ B0 → D*+ τ ν 11574011 60.63 2.97 17.10 17.82 22.74
6 D0 B- → D*0 τ ν 12773400 35.07 1.72 9.89 10.31 13.15
7 D0/D*+ B0 → D**+ μ ν 11874430 154.35 7.56 43.53 45.38 57.88
8 D0/D*+ B0 → D**+ τ ν 11874440 1.20 0.06 0.34 0.35 0.45
9 D0/D*+ B- → D**0 μ ν 12873450 126.78 6.21 35.75 37.27 47.54
10 D0/D*+ B- → D**0 τ ν 12873460 1.80 0.09 0.51 0.53 0.68
11 D0 B- → D**(→D0ππ) μ ν 12675011 22.21 1.09 6.26 6.53 8.33
12 D0 B0 → D**(→D0ππ) μ ν 11674401 24.54 1.20 6.92 7.22 9.20
13 D0/D*+ B- → D*(→D+ππ) μ ν 12675402 15.87 0.78 4.48 4.67 5.95
14 D0/D*+ B0 → D*(→D+ππ) μ ν 11676012 16.24 0.80 4.58 4.77 6.09
15 D0 B- → D*(→D0ππ) μ ν 12875440 26.62 1.30 7.51 7.83 9.98
16 D0 Bs → Ds**(→D0K) μ ν 13874020 5.48 0.27 1.55 1.61 2.06
17 D*+ Bs → D**+μ ν 13674000 5.04 0.25 1.42 1.48 1.89
18 D0 B0 → D0(Xc → μ νX')X 11894600 125.90 6.17 35.50 37.01 47.21
19 D0 B0 → D0(Ds → τν)X 11894200 3.46 0.17 0.97 1.02 1.30
20 D0 B+ → D0(Xc → μ νX')X 12893600 75.81 3.71 21.38 22.29 28.43
21 D0 B+ → D0(Ds → τν)X 12893610 8.87 0.43 2.50 2.61 3.33
22 D*+ B0 → D*+ (Xc → μ ν X')X 11894610 44.62 2.19 12.58 13.12 16.73
23 D*+ B0 → D*+(Ds → τ ν) X 11894210 4.12 0.20 1.16 1.21 1.54
24 D*+ B+ → D*+ (Xc → μ ν X')X 12895400 18.03 0.88 5.09 5.30 6.76
25 D*+ B+ → D*+(Ds → τ ν) X 12895000 3.00 0.15 0.85 0.88 1.13

From Yipeng's experience, 11574021 is about 200 GB for each polarity.

A naive estimation on size per Million event on disk: 5.16 GB / 1M evt.

Note: Below the size are for BOTH polarities, BEFORE local branch removal

Production for Alex:

Production for Manuel:

Production for Yipeng:

yipengsun commented 2 years ago

I tried to skim the TO MagDown of normalization without merging them. Before skimming, the total size is ~197 GB. After it's ~122 GB.

yipengsun commented 2 years ago

The commit message above should be SIGNAL, instead of normalization, but let's not rewrite history for a typo :-P

yipengsun commented 2 years ago

@manuelfs @afernez I've updated a preliminary plan for ntuple production. The main idea is: Trying to not exceed 1 TB for each person.

Note that the production plan is blocked until #99 is resolved, as that marks the job-submission all the way to ntuple merging on server has been validated at least once.

manuelfs commented 2 years ago

I tried sending jobs, but got this error when entering ganga

|22:44:42|lxplus789:~$ ganga

*** Welcome to Ganga ***
Version: 8.5.7
Documentation and support: http://cern.ch/ganga
Type help() or help('index') for online help.

This is free software (GPL), and you are welcome to redistribute it
under certain conditions; type license() for details.

INFO     reading config file /cvmfs/[ganga.cern.ch/Ganga/install/8.5.7/lib/python3.8/site-packages/ganga/GangaLHCb/LHCb.ini](http://ganga.cern.ch/Ganga/install/8.5.7/lib/python3.8/site-packages/ganga/GangaLHCb/LHCb.ini)
INFO     reading config file /cvmfs/[lhcb.cern.ch/lib/GangaConfig/config/8-0-0/GangaLHCb.ini](http://lhcb.cern.ch/lib/GangaConfig/config/8-0-0/GangaLHCb.ini)
2022/02/03 22:44:58 ERROR: Unauthorized 401 - do you have authentication tokens?
Error "/usr/bin/myschedd.sh |": command terminated with exit code 256
Configuration Error Line 0 while reading config source /usr/bin/myschedd.sh |

I emailed lhcb-distributed-analysis@cern.ch, and they suggested I looked into my .bashrc. I removed these two suspcious lines

export KRB5_CONFIG=/etc/krb5.conf
export KRB5CCNAME=FILE:/var/tmp/krb5_cc_cache

and the error went away!

I then submitted jobs for the 11874050 11874070 12874010 12874030 with this script, and they seem to go through.

Ganga In [6]: jobs
Ganga Out [6]: 
Registry Slice: jobs (11 objects)
--------------
    fqid |    status |      name | subjobs |    application |        backend |                             backend.actualCE |                       comment |  subjob status 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       0 |       new |           |         |     Executable |      Localhost |                                              |                               |          0 / 0 
       1 |    failed |First gang |         |      GaudiExec |          Dirac |                                          ANY |                               |          0 / 0 
       2 |    failed |First gang |         |      GaudiExec |          Dirac |                                          ANY |                               |          0 / 0 
       3 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       4 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       5 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       6 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       7 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       8 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
       9 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 
      10 |       new |Dst_D0--22 |         |     Executable |      Localhost |                                              |Dst_D0--22_02_04--mc--tracker_ |        0/0/0/0 

No idea what those first 4 jobs are though.

yipengsun commented 2 years ago

Those probably the test jobs you submitted a long time ago, you can safely remove them with, say jobs[0].remove() in ganga.

afernez commented 2 years ago

I also submitted my (small) jobs, with this script.

I can see on Dirac that most of my jobs are done running already, so probably they just have to be downloaded to eos now. I am slightly worried about the state of ganga (as usual...), because it seems to be freezing again when I try to enter the ipython session. I will let it go for a while, though, and hope that the jobs are downloading correctly, and once they're done that I'll be able to run ganga normally.

manuelfs commented 2 years ago

My jobs do not seem to have started (same output as above when I type jobs in ganga). Is there anything that I can check?

yipengsun commented 2 years ago

For job not starting, I have no idea. Maybe wait another day and if they still don't start, run jobs[index].resubmit()?

yipengsun commented 2 years ago

I finish production for all Bs jobs. The DaVinci ntuples are 12 GB total (this is before any local skimming). After skimming the total size is about 7.5 GB.

manuelfs commented 2 years ago

My jobs never started, so I deleted them, and resubmitted them. Then I realized that I got an error

|03:56:12|lxplus776:~/code/lhcb-ntuples-gen/run2-rdx/jobs$ ./22_02_03-tracker_only_ddx_22to25.sh

*** Welcome to Ganga ***
Version: 8.5.7
Documentation and support: http://cern.ch/ganga
Type help() or help('index') for online help.

This is free software (GPL), and you are welcome to redistribute it
under certain conditions; type license() for details.

INFO     reading config file /afs/cern.ch/user/m/manuelf/.gangarc
INFO     reading config file /cvmfs/ganga.cern.ch/Ganga/install/8.5.7/lib/python3.8/site-packages/ganga/GangaLHCb/LHCb.ini
INFO     reading config file /cvmfs/lhcb.cern.ch/lib/GangaConfig/config/8-0-0/GangaLHCb.ini
INFO     Using LHCbDirac version prod
 === Welcome to Ganga on CVMFS. In case of problems contact lhcb-distributed-analysis@cern.ch === 
Reconstruction script: ../reco_Dst_D0.py
Condition file: ../conds/cond-mc-2016-md-sim09k-tracker_only.py
LFN: /MC/2016/Beam6500GeV-2016-MagDown-TrackerOnly-Nu1.6-25ns-Pythia8/Sim09k/Reco16/Filtered/11874050/D0TAUNU.SAFESTRIPTRIG.DST
NTuple name: Dst_D0--22_02_07--mc--tracker_only--MC_2016_Beam6500GeV-2016-MagDown-TrackerOnly-Nu1.6-25ns-Pythia8_Sim09k_Reco16_Filtered_11874050_D0TAUNU.SAFESTRIPTRIG.DST.root
Truncated job name: Dst_D0--22_02_07--mc--11874050--tracker_only--MC_2016_Beam6500GeV-2016-MagDown-T
Preparing job Dst_D0--22_02_07--mc--11874050--tracker_only--MC_2016_Beam6500GeV-2016-MagDown-T
GangaDiracError: All the files are only available on archive SEs. It is likely the data set has been archived. Contact data management to request that it be staged
(consider --debug option for more information)
INFO     Stopping the DIRAC process
INFO     Stopping Job processing before shutting down Repositories
INFO     Shutting Down Ganga Repositories
INFO     Registry Shutdown

@afernez @yipengsun Have you gotten GangaDiracError: All the files are only available on archive SEs. It is likely the data set has been archived. Contact data management to request that it be staged before? Is it a bug on my part or precisely the 8 samples I sent were not staged?

yipengsun commented 2 years ago

I had 1 similar problem before, which turned out to be that the LFNs were wrong (remember that at one time I asked Svende to send an email to some convenor to re-stage these files, only to discover that these files don't exist).

In this case, since the main part of the LFN is working: /MC/2016/Beam6500GeV-2016-MagDown-TrackerOnly-Nu1.6-25ns-Pythia8/Sim09k/Reco16/Filtered/, I'm wondering if the MC IDs listed in the table is actually wrong? Could you check that on DIRAC?

yipengsun commented 2 years ago

And now you see this error message:

GangaDiracError: All the files are only available on archive SEs. It is likely the data set has been archived. Contact data management to request that it be staged
(consider --debug option for more information)

can be misleading, because it will print out this message even if the sample doesn't exist anywhere!

yipengsun commented 2 years ago

I also checked the table in the top post and the data sources listed in our wiki. They are consistent. So if the MC IDs are wrong, they are at least wrong in a consistent way.

manuelfs commented 2 years ago

Thank you Yipeng!

It was indeed that the MC IDs didn't exist (I had mistakenly used the Run 1 IDs...). From now on I'll know that this message is misleading.

I submitted the jobs with the correct IDs and now they appear as submitted and running as opposed to new.

    fqid |    status |      name | subjobs |    application |        backend |                             backend.actualCE |                       comment |  subjob status 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      19 | submitted |Dst_D0--22 |      42 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |       40/0/0/0 
      20 | submitted |Dst_D0--22 |      61 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |       60/0/0/0 
      21 |   running |Dst_D0--22 |       9 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |        9/0/0/0 
      22 |   running |Dst_D0--22 |       9 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |        9/0/0/0 
      23 | submitted |Dst_D0--22 |      25 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |       23/0/0/0 
      24 | submitted |Dst_D0--22 |      28 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |       27/0/0/0 
      25 |   running |Dst_D0--22 |       7 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |        7/0/0/0 
      26 |   running |Dst_D0--22 |       8 |      GaudiExec |          Dirac |                                         None |Dst_D0--22_02_07--mc--tracker_ |        8/0/0/0 

I also committed the corrected script.

By the way, here's the number of MC events in disk per year in markdown

# Sample Name MC ID TOTAL [M] 2015 [M] 2016 [M] 2017 [M] 2018 [M]
1 D0 B- → D0 μ ν 12573012 161.14 7.90 45.44 47.37 60.43
2 D0/D*+ B0 → D*+ μ ν 11574021 274.86 13.47 77.51 80.81 103.07
3 D0 B- → D*0 μ ν 12773410 452.40 22.17 127.58 133.00 169.65
4 D0 B- → D0 τ ν 12573001 11.08 0.54 3.13 3.26 4.16
5 D0/D*+ B0 → D*+ τ ν 11574011 60.63 2.97 17.10 17.82 22.74
6 D0 B- → D*0 τ ν 12773400 35.07 1.72 9.89 10.31 13.15
7 D0/D*+ B0 → D**+ μ ν 11874430 154.35 7.56 43.53 45.38 57.88
8 D0/D*+ B0 → D**+ τ ν 11874440 1.20 0.06 0.34 0.35 0.45
9 D0/D*+ B- → D**0 μ ν 12873450 126.78 6.21 35.75 37.27 47.54
10 D0/D*+ B- → D**0 τ ν 12873460 1.80 0.09 0.51 0.53 0.68
11 D0 B- → D**(→D0ππ) μ ν 12675011 22.21 1.09 6.26 6.53 8.33
12 D0 B0 → D**(→D0ππ) μ ν 11674401 24.54 1.20 6.92 7.22 9.20
13 D0/D*+ B- → D*(→D+ππ) μ ν 12675402 15.87 0.78 4.48 4.67 5.95
14 D0/D*+ B0 → D*(→D+ππ) μ ν 11676012 16.24 0.80 4.58 4.77 6.09
15 D0 B- → D*(→D0ππ) μ ν 12875440 26.62 1.30 7.51 7.83 9.98
16 D0 Bs → Ds**(→D0K) μ ν 13874020 5.48 0.27 1.55 1.61 2.06
17 D*+ Bs → D**+μ ν 13674000 5.04 0.25 1.42 1.48 1.89
18 D0 B0 → D0(Xc → μ νX')X 11894600 125.90 6.17 35.50 37.01 47.21
19 D0 B0 → D0(Ds → τν)X 11894200 3.46 0.17 0.97 1.02 1.30
20 D0 B+ → D0(Xc → μ νX')X 12896300 75.81 3.71 21.38 22.29 28.43
21 D0 B+ → D0(Ds → τν)X 12896310 8.87 0.43 2.50 2.61 3.33
22 D*+ B0 → D*+ (Xc → μ ν X')X 11894610 44.62 2.19 12.58 13.12 16.73
23 D*+ B0 → D*+(Ds → τ ν) X 11894210 4.12 0.20 1.16 1.21 1.54
24 D*+ B+ → D*+ (Xc → μ ν X')X 12895400 18.03 0.88 5.09 5.30 6.76
25 D*+ B+ → D*+(Ds → τ ν) X 12895000 3.00 0.15 0.85 0.88 1.13
yipengsun commented 2 years ago

I think we should copy your MD table to the top post, as well as adding it to our docs, as it is SUPER useful.

yipengsun commented 2 years ago

Added Manuel's markdown table at: https://github.com/umd-lhcb/rdx-run2-analysis/blob/master/docs/mc_prod.md#overview

yipengsun commented 2 years ago

@manuelfs Actually, the pdf version of the IDs do contain at least 2 errors: The index 20 and 21 should be 12893600, 12893610.

I can fix the markdown table, can you fix the pdf table?

afernez commented 2 years ago

@yipengsun With your fix of placing my gangadir on AFS but setting the workspace folder as a soft link to a folder on EOS, everything is finally working for me with ganga! I've submitted the jobs for my "remaining D* Tau Nu" task above now, and they're running normally (and my monitoring is working correctly).

yipengsun commented 2 years ago

Submitted the J/psi K data job. Now for me the only missing part is the J/psi K MC.

manuelfs commented 2 years ago

@manuelfs Actually, the pdf version of the IDs do contain at least 2 errors: The index 20 and 21 should be 12893600, 12893610.

I can fix the markdown table, can you fix the pdf table?

I fixed the source Excel files and committed a .pdf version of the split-by-year table https://github.com/umd-lhcb/group-talks/tree/master/rdx/tables

yipengsun commented 2 years ago

Great. I'll add a link to that table and remove the bugged pdf then, to avoid confusion.

yipengsun commented 2 years ago

I have 1 French and 1 Russian backend that keep failing: CPPM.fr and RRCKI.ru. I'm going to blacklist them and resubmit.

I believe the correct way to do it is this:

for sj in jobs[181].subjobs.select(status='failed'):
    sj.backend.settings["BannedSites"].append("LCG.RRCKI.ru")

Then resubmit.

yipengsun commented 2 years ago

Some of the inputs are only available in RRCKI.ru. When I ban it, the jobs fails immediately.

ganga

yipengsun commented 2 years ago

I used the following command to get input LFNs to a subjob:

jobs[189].subjobs[57].inputdata.getLFNs()
manuelfs commented 2 years ago

I'm trying to commit the DD ntuples, but keep getting the following errors when trying to sync

|10:32:30|glacier:~/code/lhcb-ntuples-gen$ git annex sync
pull origin 
Warning: Permanently added the ECDSA host key for IP address '140.82.113.4' to the list of known hosts.
ok
pull glacier 
ok
push origin 
Enumerating objects: 206, done.
Counting objects: 100% (206/206), done.
Delta compression using up to 32 threads
send-pack: unexpected disconnect while reading sideband packet
Compressing objects: 100% (158/158), done.
fatal: the remote end hung up unexpectedly

I tried git config http.postBuffer 524288000 as suggested here, and also

export GIT_TRACE_PACKET=1
export GIT_TRACE=1
export GIT_CURL_VERBOSE=1

as suggested here, to no avail. Any ideas?

yipengsun commented 2 years ago

Hmm it could be because I'm copying a large amount of files to glacier? I'm not sure if it's relevant though. I'll google it and see what's going on.

yipengsun commented 2 years ago

Wait, this is an error on github's part. Did you accidentally commited some large files to git directly, instead of annexing them?

yipengsun commented 2 years ago

Something like this: https://github.community/t/git-push-gives-me-fatal-the-more-end-hung-up-expectedly/183597/2

manuelfs commented 2 years ago

Looking at the history, I may have added some of the files with git add . indeed. I deleted my last commit (it was not pushed), unstaged the files, and will try to commit them again.

yipengsun commented 2 years ago

With my updated batch_skim.sh, I got:

Verifying output for Job 180, which has 111 subjobs...
subjob 85: ntuple missing!
subjob 88: ntuple missing!
subjob 98: ntuple missing!
Job 180 output verification failed with 3 error(s).

If I manually remove folder 85:

Verifying output for Job 180, which has 111 subjobs...
Found 110 subjobs, which =/= 111. Terminate now.

The actual outputs are colored. Also, the script will terminate if current job has any error.

yipengsun commented 2 years ago

I counted 14 MC species (each has 2 folders, 1 per polarity) as of Feb 27, 2022. The numbers checks out.

yipengsun commented 2 years ago

The MC ghost production will be tracked in https://github.com/umd-lhcb/lhcb-ntuples-gen/issues/115. Closed.