scarlehoff / pyHepGrid

Tool for distributed computing management geared towards HEP applications.
GNU General Public License v3.0
6 stars 4 forks source link

pyHepGrid -s returns Done:0 even if job status is FINISHED #36

Closed GandalfTheWhite2 closed 4 years ago

GandalfTheWhite2 commented 5 years ago
bash-4.2$ pyHepGrid man pyHepGrid/hej_runcard.py -B  -s -j 114
> Using header file HEJ.hej_header.py
Value set: hej_runcard          arc_submit_threads : 5
Value set: hej_runcard          arcbase         : .arc/jobs.dat
Value set: hej_runcard          baseSeed        : 6171
Value set: hej_runcard          base_dir        : <function base_dir at 0x7fefaa38ac80>
Value set: hej_runcard          ce_base         : ce1.dur.scotgrid.ac.uk
Value set: hej_runcard          dbname          : hej_database
Value set: hej_runcard          dirac_name      : marian.heil
Value set: hej_runcard          events          : -1
Value set: hej_runcard          finalisation_script : results/CombineRuns_main
Value set: hej_runcard          gfaldir         : xroot:/se01.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/pheno/andersen/WJets_new
Value set: hej_runcard          grid_input_dir  : Wjets/input
Value set: hej_runcard          grid_output_dir : Wjets/output
Value set: hej_runcard          grid_warmup_dir : Wjets/HEJ.tar.gz
Value set: hej_runcard          jobName         : HEJ_Wjets
Value set: hej_runcard          local_run_directory : Wjets
Value set: hej_runcard          producRun       : 200
Value set: hej_runcard          provided_warmup_dir : Setup
Value set: hej_runcard          runcardDir      : Setup
Value set: hej_runcard          split_dur_ce    : False
> Arc Production
[114] Wm2j_HT2_7TeV       : config_all Done:    0  Waiting:    0  Running:    0  Failed:    0  Missing:    0  Total:  200

but

Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo
 Name: HEJ_Wjets
 State: Finished
 Specific state: FINISHED
 Exit Code: 0
 Owner: /C=UK/O=eScience/OU=Durham/L=eScience/CN=jeppe andersen
 Other Messages: SubmittedVia=org.nordugrid.gridftpjob
 Queue: ce1
 Requested Slots: 1
 Stdin: /dev/null
 Stdout: stdout
 Stderr: stderr
 Computing Service Log Directory: testjob.log
 Submitted: 2019-10-01 00:20:41
 End Time: 2019-10-01 12:36:55
 Submitted from: 193.60.193.50:58672
 Used CPU Time: 9 hours 58 minutes
 Used Wall Time: 9 hours 58 minutes
 Results must be retrieved before: 2019-10-04 10:21:45
 Proxy valid until: 2019-10-01 22:53:22
 Entry valid from: 2019-10-01 13:03:19
 Entry valid for: 3 hours
 ID on service: QViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo
 Service information URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(objectClass=*) (org.nordugrid.ldapng)
 Job status URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(nordugrid-job-globalid=gsiftp:\2f\2fce1.dur.scotgrid.ac.uk:2811\2fjobs\2fQViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo) (org.nordugrid.ldapng)
 Job management URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs (org.nordugrid.gridftpjob)
 Stagein directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo
 Stageout directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo
 Session directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QViNDmJ9JZvnXk5IKnL2N00mABFKDmABFKDmlgKKDmABFKDmuRj7Jo

Status of 199 jobs was queried, 199 jobs returned information
GandalfTheWhite2 commented 5 years ago

And quite why it thinks it should be only 199 jobs (and not 200) is a second question. It seemed to happen to ~half of the 8 similar jobs submitted - 4 with 200 identified jobs, 4 with only 199. The slurm backend and the actual queue had all 1600 jobs, but it seems the pyHepGrid database only retains some of them? This was submitted with 5 parallel queues. Does --get-data still get the data even if the job info is missing from the pyHepGrid database, but the output file is saved on the grid disk?

scarlehoff commented 5 years ago

Does --get-data still get the data even if the job info is missing from the pyHepGrid database, but the output file is saved on the grid disk?

Not sure what the question is. If the question is whether --get-data download jobs that are not marked as done in the database? The answer is yes.*

If you mean jobs which were not recorded at all, it is more complicated. There were several finalization scripts to get the data. For the default one it should ask you whether the range of seeds you are downloading is correct and, if it isn't, you can't insert your own.

*Note: You can use the --done option to only take into account finished jobs.

scarlehoff commented 5 years ago

Also, the issue itself is known. There are times where (for whatever reason) the job information in the pyhepgrid database is not in sync with arc or where you have no way of getting the information from the Arc system (if you run in other clusters from other universities for instance). If you set short_stats in your header to False you should see all those jobs marked as unknown.

GandalfTheWhite2 commented 5 years ago

Should clusters at other institutions not still return a job status?

scarlehoff commented 5 years ago

Don't know why I was never able to get any feedback from Edinburgh or Manchester (even if the jobs did run and copied things to the grid storage systems).

GandalfTheWhite2 commented 5 years ago

gah. frustrating. We hope to ~double the Durham cluster soon, so there is less reason to run elsewhere and we can go and kick the equipment when it fails.

jcwhitehead commented 5 years ago

@GandalfTheWhite2 I'm a bit late to this and don't quite follow - which command was the second set of output produced by?

As for the headline 'Done: 0' problem, I think that should now be fixed in the latest revision (these bugs are all an indirect consequence of the move to multicore submission, so it's good that we're catching them while they're fresh).

GandalfTheWhite2 commented 5 years ago

The latter output is produced by pyHepGrid .... -I and shows the status is correctly identified as Finished, even though the -s option doesn't list them as Done.

GandalfTheWhite2 commented 5 years ago

And the issue of registering them as Done is indeed solved by the latest release. Thanks.

jcwhitehead commented 5 years ago

Hmm, that's weird and I have no idea why there's a discrepancy between -Bs and -BsI. In principle -BsI just runs arcstat for all the jobs in the database entry so the two lists should probably be of consistent length! Does pyHepGrid man pyHepGrid/hej_runcard.py -Bs show the correct number of jobs in the relevant table entries, or does that change between 199 and 200 too?

GandalfTheWhite2 commented 5 years ago

With the latest version it is still changing between 199 and 200 - even if the total is reported as 200 with just one non-zero entry of 199: bash-4.2$ pyHepGrid man pyHepGrid/hej_runcard.py -Bs -j 120

Using header file HEJ.hej_header.py Value set: hej_runcard arc_submit_threads : 5 Value set: hej_runcard arcbase : .arc/jobs.dat Value set: hej_runcard baseSeed : 6171 Value set: hej_runcard base_dir : <function base_dir at 0x7f23ed70ec80> Value set: hej_runcard ce_base : ce1.dur.scotgrid.ac.uk Value set: hej_runcard dbname : hej_database Value set: hej_runcard dirac_name : marian.heil Value set: hej_runcard events : -1 Value set: hej_runcard finalisation_script : results/CombineRuns_main Value set: hej_runcard gfaldir : xroot:/se01.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/pheno/andersen/WJets_new Value set: hej_runcard grid_input_dir : Wjets/input Value set: hej_runcard grid_output_dir : Wjets/output Value set: hej_runcard grid_warmup_dir : Wjets/HEJ.tar.gz Value set: hej_runcard jobName : HEJ_Wjets Value set: hej_runcard local_run_directory : Wjets Value set: hej_runcard producRun : 200 Value set: hej_runcard provided_warmup_dir : Setup Value set: hej_runcard runcardDir : Setup Value set: hej_runcard split_dur_ce : False Arc Production [120] Wp4j_HT2_7TeV : config_all Done: 199 Waiting: 0 Running: 0 Failed: 0 Missing: 0 Total: 200

DWalker487 commented 5 years ago

I suspect this may be to do with the caching of job statuses that we do internally, in order to speed up getting the status of jobs that we already know to have finished. Perhaps there is an inconsistency with the logging there, where the first/last job could be set as unknown and not being checked properly.

GandalfTheWhite2 commented 5 years ago

on another note, can I just say how impressed I am with the download speed obtained using --get_data. That's impressive.

marianheil commented 5 years ago

The missing job might be a "missing" job in the sense that it was submitted to arc but not added to the database (see discussion in #31 and commit 4866f130d8791dda082200520c1d1f43f36b9380).

I just pushed a related fix (4704333b4da4d7211ea8a8bd37352994cbfb7d63). Prior to this missing (cMISS) jobs statuses got reassigned to unknown (cUNK). In the future they should now stay missing.

All cMISS jobs should have the job_id="None". @GandalfTheWhite2 could you check if your missing job is one of the (reassigned) cMISS, e.g by printing jobid in _do_stats_job inside the Backend.py? Also as @scarlehoff said before setting short_stats=False in your runcard should show one job as "Unknown" (the short stat doesn't list cUNK).

marianheil commented 4 years ago

I haven't experienced this since the commits I mentioned above. So I assume this is fixed.