Closed jjswan33 closed 11 years ago
Hi Josh,
Do you think we have time to check w/ the sync ntuples that things are okay? If so what needs to be done for that? Working on a tag now, still need Phil to fix his shit first.
Evan
On Tue, Jun 4, 2013 at 2:47 PM, jjswan33 notifications@github.com wrote:
I need patuples for the TauPlusX Rereco and the embedded samples. Evan is preparing a tag that should be used.
Samples needed:
/TauPlusX/Run2012D-22Jan2013-v1/AOD /TauPlusX/Run2012C-22Jan2013-v1/AOD /TauPlusX/Run2012A-22Jan2013-v1/AOD /TauPlusX/Run2012B-22Jan2013-v1/AOD
/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMuParked/StoreResults-Run2012C_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMu/StoreResults-Run2012A_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMuParked/StoreResults-Run2012B_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMuParked/StoreResults-Run2012C_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMuParked/StoreResults-Run2012B_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
/DoubleMu/StoreResults-Run2012A_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208 .
I think we have some time but how would you propose we cross check? I mean I am not sure what we would compare. I guess we could process the vbf125 sample and compare with sync tuples.
I am just updating PAT recipe/EGamma tag(*)/MVA MET tag. Just want to make sure I haven't knocked us out of sync.
On Tue, Jun 4, 2013 at 2:53 PM, jjswan33 notifications@github.com wrote:
I think we have some time but how would you propose we cross check? I mean I am not sure what we would compare. I guess we could process the vbf125 sample and compare with sync tuples.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-18934549 .
@tsarangi,
OK, I think we are ready to go on these samples:
/TauPlusX/Run2012D-22Jan2013-v1/AOD /TauPlusX/Run2012C-22Jan2013-v1/AOD /TauPlusX/Run2012A-22Jan2013-v1/AOD /TauPlusX/Run2012B-22Jan2013-v1/AOD
can you pull to get the latest updates (and recompile) and PAT tuplize them please?
Please also do one embedded sample:
/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER
so we can see if it works.
@tsarangi - What is the status of these samples?
I talked to Tapas, he is having problems with his grid cert. It is being renewed as we speak. Most of it should be done but there are a few failed stragglers which need to be resubmitted.
On Mon, Jun 17, 2013 at 12:26 PM, jjswan33 notifications@github.com wrote:
@tsarangi https://github.com/tsarangi - What is the status of these samples?
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19560495 .
Resubmitting the failed ones... i.e. 2012-A,B,D The datadef has wrong run numbers for Run2012C and it was filtering out most of them. Modified that and resubmitting this dataset. The embedded test sample is being resubmitted after a change made by Evan to fix a crash for this dataset.
My grid cert is not yet renewed (waiting for approval), working on the old one... So don't know whether these job resubmission will actually work as I made a bunch of changes regarding my new certificate :( messed up....
Embedded samples:
/hdfs/store/user/tapas/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER/2013-06-11-8TeV-53X-PatTuple_Master/
I am resubmitting the failed ones once more to check all others are finished successfully...
Thanks Tapas. Please stay on top of this for us it is really urgent that we get all of these samples finished and get the Z+NJets running as well.
Seems that something I need to run the MVA MET is missing in the embedded sample:
----- Begin Fatal Exception 19-Jun-2013 07:26:21 CDT----------------------- An exception of category 'ProductNotFound' occurred while [0] Processing run: 206466 lumi: 272 event: 483296032 [1] Running path 'runAnalysisSequence' [2] Calling event method for module PFMETProducerMVA/'pfMEtMVA' Exception Message: RefCore: A request to resolve a reference to a product of type 'std::vectorreco::Track' with ProductID '4:2413' can not be satisfied because the product cannot be found. Probably the branch containing the product is not stored in the input file. Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
----- End Fatal Exception -------------------------------------------------
Some events seem to run fine though. Very strange.
Can you ask Christian what the product might be?
On Wed, Jun 19, 2013 at 7:30 AM, jjswan33 notifications@github.com wrote:
Seems that something I need to run the MVA MET is missing in the embedded sample:
----- Begin Fatal Exception 19-Jun-2013 07:26:21 CDT----------------------- An exception of category 'ProductNotFound' occurred while [0] Processing run: 206466 lumi: 272 event: 483296032 [1] Running path 'runAnalysisSequence' [2] Calling event method for module PFMETProducerMVA/'pfMEtMVA' Exception Message: RefCore: A request to resolve a reference to a product of type 'std::vector reco::Track' with ProductID '4:2413' can not be satisfied because the product cannot be found. Probably the branch containing the product is not stored in the input file. Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
----- End Fatal Exception -------------------------------------------------
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19680580 .
@tsarangi - Can you run 1000 events or so to test and see if it is working with these additional keep statements?
@tsarangi - Evan ran a test patuple and everything looks good now on my end. Can you please start running the embedded samples ASAP. I will let Evan comment on what needs to be updated.
@tsarangi all you need to do is pull from master and resubmit embedded samples, no need to recompile.
For some reason we seem to be missing about 1/fb of the rereco data. At least 500/pb from 2012C. The rest may be from 2012D.
It looks like the 2012C run range may have been off. It should be 198022-203742 as reported here:
https://twiki.cern.ch/twiki/bin/view/CMS/PdmV2012Analysis
At least what is committed is wrong but I think you may have run a different range when it was run.
@tsarangi - You didn't delete the test jobs in:
/hdfs/store/user/tapas/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER/2013-06-11-8TeV-53X-PatTuple_Master/
before you reran. They are not 'overwritten' as you say and they are still the old bugged version that are useless. Please run them ASAP. Our analysis will now be delayed another day because of this oversight.
@jjswan33 The output files are always overwritten when one resubmits even if the hdfs directory exists. It is not what happened in this case. The /scratch directory for condor stderr, stdout wasn't deleted and therefore the jobs with new fix didn't get submitted for 2012D at all. This was an oversight on my part, sorry about that. They are being submitted now, should be done soon.... Let me know if you see other problem...
Thanks. So far everything else looks ok will let you know if I see more problems.
They haven't appeared yet, but check as they appear here, /hdfs/store/user/tapas/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER/
Ok. They should be fine if you used the same area as the rest but I will check once something appears.
On another front I am still concerned about the data. I am still missing luminosity so either you are missing data in the patuple or jobReportSummary.py is not working all of the sudden.
@jjswan33
As lesson learned several times, don't take all these for granted. You should check the patTuples as soon as they arrive and that way you can avoid any last minute panic...
For lumis, my two cents: Compare the number of patTuple files with the total number of AOD files for each dataset. That can give some clue... As far as I know, only 2 files are missing in 2012C. How much data are you missing ? Do you think they may correspond to the missing files ?
Like I mentioned earlier, there are two possibilities (1) The existing patTuple files are corrupted : In this case, I need evidence that they are corrupted and need a resubmission. For example, show me a log of your jobs which failed while reading the patTuple files.
(2) When you are looping over the files, the jobs are not finishing successfully due to some other reason and you are missing a bunch of them. I don't know what a jobReportSummary does, never used it before.
Can someone double check what you are doing, meaning independently run their code on the patTuple files... ?
Hi Tapas,
No worries I will check as soon as some jobs start finishing.
About the Lumi. 2 files cannot possibly account for the missing lumi, also I have no evidence of any corrupted files. If you can point me to the scratch working directory for the 2012C patuple jobs I can run jobReportSummary on it. I will check for missing output files on my side soon.
here are the condor submit directories:
login06:/scratch/tapas/2013-06-11-8TeV-53X-PatTuple_Master/
@tsarangi - I have looked in this directory. There is only 7098 job reports in this directory for 2012C and also according to the dag file there is significant failures.
Also rerunning from scratch 2012C gives me the same luminosity I got the first time with 7855 output files.
I would request that you delete 2012C and 2012D TauPlusX and resubmit them. I don't know how else to proceed something has to be wrong.
The dag file doesn't report the resubmission of failed jobs. Resubmission happens individually for each job, not through dag.
What to do you mean something has to be wrong ? Please resubmit the jobs yourself to avoid the same mistake that I might have done...
Fine... I didn't ask for you to be in charge of this it was Sridhara's idea. I won't bother you with patuple requests any more. It was more efficient for me before you were doing it anyway..
I always used this magic command to only resubmit failed PAT tuple jobs:
ls /scratch/efriis/2012-10-02-8TeV-v1-Higgs/*/dags/dag.status | xargs grep -lir ERR | sed -e "s|status|rescue001|" | xargs -I{} -n 1 farmoutAnalysisJobs --rescue-dag-file={}
Which has seemed to work in the past at least.
On Sun, Jun 23, 2013 at 11:43 AM, Tapas Sarangi notifications@github.comwrote:
The dag file doesn't report the resubmission of failed jobs. Resubmission happens individually for each job, not through dag.
What to do you mean something has to be wrong ? Please resubmit the jobs yourself to avoid the same mistake that I might have done...
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19876832 .
i don't know what else to say. If the number of output patTuple files matches (or close enough) with the total number of AOD files in the dataset, why does it matter what the dag output provides ? Can someone tell me what I am missing here besides resubmit the whole thing. May be I am unaware of how CMS luminosity is calculated ! Is there any other way to calculate luminosity besides dag output ?
The dag file isn't used. We are using the .xml files that are produced in the submit directory. In your directory there are only 7098 files. These should be there regardless wether you resubmitted them or not so I don't know what to say either. All I can say for sure is either the jobReportSummary code doesn't work correctly anymore or the luminosity isn't all there.
@ekfriis I am trying your command and it is resubmitting the same jobs that were already done. This way you are using the resubmission through dag. I used --resubmit-failed-jobs without the dag and this way it doesn't override the dag status that existed for the failed jobs, but it does resubmit them....
OK, whatever works. Josh how many output files do you expect instead of 7098
I suggest someone try and recover the missing lumi, defined by the difference in the JSON files, in a separate job set using the feature I implemented here
https://github.com/uwcms/FinalStateAnalysis/pull/227
On Sun, Jun 23, 2013 at 12:06 PM, Tapas Sarangi notifications@github.comwrote:
@ekfriis https://github.com/ekfriis I am trying your command and it is resubmitting the same jobs that were already done. This way you are using the resubmission through dag. I used --resubmit-failed-jobs without the dag and this way it doesn't override the dag status that existed for the failed jobs, but it does resubmit them....
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19877173 .
OK, here is what I did, I tried resubmitting using the dag option and it ran for a couple of minutes before existing successfully. I believe it checked the output files of the jobs and then exited...? and of course overrode the dag status file. Can you recalculate luminosity using jobReportSummary...
Still no change. Only 7098 .xml files. Like I said earlier it has nothing to do with the dag file. jobReportSummary only uses the .xml files in each individual job which should be output wether or not you use dags.
ok. can you check for a few xml files if there is an output patTuple in the /hdfs area ? If there is one, can you run on it to check if it is corrupted ? If it is not corrupted, is there any way to calculate the integrated luminosity from the output files rather than the xml files ? I guess Isobel calculated that the missing files correspond to only 15 pb-1. How she did this ? May be the same procedure can be used to calculate the entire lumi !?
Josh, if I look in:
/hdfs/store/user/tapas/TauPlusX/Run2012C-22Jan2013-v1/AOD/2013-06-11-8TeV-53X-PatTuple_Master/patTuple_cfg-* I see 7855 files.
I think you can write a little script to cross the names of the files in hdfs and the names of the 7098 xml files to see which ones are missing.
(Or just in the scratch, check the directories that have a .err but no .xml? There are 7855 .errs...)
Well this is the problem. I ran on every file (give or take one or two) and checked the output .xml files from those jobs and I am missing 1/fb. So assuming there is no problem with the lumi scripts then the lumi is missing in the patuples.
But what I mean is that you can isolate the files that are broken, by just picking the ones that have a patuple but no xml? and resubmit only those?
Is this possible that a patTuple ROOT file can be written successfully with missing lumi blocks from an AOD file ? Isn't the patTuple files have 1-to-1 correspondence with an AOD file ? So, if there is no corruption in the patTuple file, then how is it possible to have missing lumi files ?
@mcepeda Yes, that might work, but I have to figure out how to submit single jobs, but I guess there are ways. One is to delete the output files and resubmit them....
I don't have any explanation. I just don't get all of the luminosity when I run on the patuples without job failures and then I can't confirm that all the luminosity is in the patuples.
So to me the fastest way to proceed is to rerun them which I am working on updating my area now. If we can find some other way to try and recover it from the existing output then we should try.
More numerology:
In /scratch/tapas/2013-06-11-8TeV-53X-PatTuple_Master/data_TauPlusX_Run2012C_22Jan2013_v1
For two sample cases of failure: a) /scratch/tapas/2013-06-11-8TeV-53X-PatTuple_Master/data_TauPlusX_Run2012C_22Jan2013_v1/submit/patTuple_cfg-02062A18-577A-E211-94C9-0030487E52A3/ (this has empty err and failed .out, and no xml)
b) /scratch/tapas/2013-06-11-8TeV-53X-PatTuple_Master/data_TauPlusX_Run2012C_22Jan2013_v1/submit/patTuple_cfg-80B01A12-BF77-E211-9BC0-0030487E52A5 (this has empty err and failed .out, but does have an xml)
I can do a list of the 1022 strange ones if you know how to resubmit only those (I think if you delete the submit dire only for those it should do the trick?)
Ah! I can open some of the outs (the ones that do have an .xml even though they are too little): File already exists: srm://cmssrm.hep.wisc.edu:8443/srm/v2/server?SFN=/hdfs/store/user/tapas/TauPlusX/Run2012C-22Jan2013-v1/AOD/2013-06-11-8TeV-53X-PatTuple_Master/patTuple_cfg-80B01A12-BF77-E211-9BC0-0030487E52A5.root; exiting as though successful.
I cannot open the ones without .xml associated
Hi Maria,
I think this happened when I used dag to resubmit the jobs after Evan suggested this. I am using --resubmit-failed-jobs for the submission, which used CondorUserLog.py to check the log files of a job rather than dag. In any case, can you just test you can read this particular file correctly, because I have been hearing from Josh that he could read the patTuple files without any problem, so I doubt this is a part of the problem.
Hey guys-- I think it had to do with the certificate junk mentioned above. The number .root files that are older than 6 days: » find /hdfs/store/user/tapas/TauPlusX/Run2012C-22Jan2013-v1/AOD/2013-06-11-8TeV-53X-PatTuple_Master/patTuple_cfg*.root -mtime +6 | wc -l 757
is equal to: 7855 (root files in hdfs) - 7098 (xml files) = 757
(not sure what to make of it, just thought I'd point it out)
On Sun, Jun 23, 2013 at 1:19 PM, Tapas Sarangi notifications@github.comwrote:
Hi Maria,
I think this happened when I used dag to resubmit the jobs after Evan suggested this. I am using --resubmit-failed-jobs for the submission, which used CondorUserLog.py to check the log files of a job rather than dag. In any case, can you just test you can read this particular file correctly, because I have been hearing from Josh that he could read the patTuple files without any problem, so I doubt this is a part of the problem.
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19878473 .
I just mean the patTuple_cfg-SomeStringOfNumbers.out in the submit directory.
The files with missing xml are precisely the ones that I cannot open to read (so I cannot see what happened). I think it is too much coincidence for it to be just a fluke.
Can you read this one for instance?
/scratch/tapas/2013-06-11-8TeV-53X-PatTuple_Master/data_TauPlusX_Run2012C_22Jan2013_v1/submit/patTuple_cfg-02062A18-577A-E211-94C9-0030487E52A3/patTuple_cfg-02062A18-577A-E211-94C9-0030487E52A3.out
Reading Ian's mail --> I think we have it...
@iross, @mcepeda can you produce a list of these files ? i can resubmit them
The files older than 6 days:
http://www.hep.wisc.edu/~iross/oldish_taupX_pattuples.txt
I don't really understand the timeline here, to be honest. I'm guessing these pattuples were run with incorrect runrange defs, so they only containt a subset of what they should be. And then the resubmission didn't work since the output files already existed? I don't really know how to figure it out (and have my own issues to tackle), but it seems like we should try to track down exactly what went wrong so it doesn't happen again.
On Sun, Jun 23, 2013 at 1:30 PM, Tapas Sarangi notifications@github.comwrote:
@iross https://github.com/iross, @mcepeda https://github.com/mcepeda can you produce a list of these files ? i can resubmit them
— Reply to this email directly or view it on GitHubhttps://github.com/uwcms/FinalStateAnalysis/issues/208#issuecomment-19878637 .
I need patuples for the TauPlusX Rereco and the embedded samples. Evan is preparing a tag that should be used.
Samples needed:
/TauPlusX/Run2012D-22Jan2013-v1/AOD /TauPlusX/Run2012C-22Jan2013-v1/AOD /TauPlusX/Run2012A-22Jan2013-v1/AOD /TauPlusX/Run2012B-22Jan2013-v1/AOD
/DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMuParked/StoreResults-Run2012D_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMuParked/StoreResults-Run2012C_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMu/StoreResults-Run2012A_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMuParked/StoreResults-Run2012B_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMuParked/StoreResults-Run2012C_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMuParked/StoreResults-Run2012B_22Jan2013_v1_RHembedded_trans1_tau115_ptelec1_20had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER /DoubleMu/StoreResults-Run2012A_22Jan2013_v1_RHembedded_trans1_tau116_ptmu1_16had1_18_v1-f456bdbb960236e5c696adfe9b04eaae/USER