Closed GandalfTheWhite2 closed 4 years ago
And quite why it thinks it should be only 199 jobs (and not 200) is a second question. It seemed to happen to ~half of the 8 similar jobs submitted - 4 with 200 identified jobs, 4 with only 199. The slurm backend and the actual queue had all 1600 jobs, but it seems the pyHepGrid database only retains some of them? This was submitted with 5 parallel queues. Does --get-data still get the data even if the job info is missing from the pyHepGrid database, but the output file is saved on the grid disk?
Does --get-data still get the data even if the job info is missing from the pyHepGrid database, but the output file is saved on the grid disk?
Not sure what the question is. If the question is whether --get-data
download jobs that are not marked as done
in the database?
The answer is yes.*
If you mean jobs which were not recorded at all, it is more complicated. There were several finalization scripts to get the data. For the default one it should ask you whether the range of seeds you are downloading is correct and, if it isn't, you can't insert your own.
*Note: You can use the --done
option to only take into account finished jobs.
Also, the issue itself is known.
There are times where (for whatever reason) the job information in the pyhepgrid database is not in sync with arc or where you have no way of getting the information from the Arc system (if you run in other clusters from other universities for instance). If you set short_stats
in your header to False
you should see all those jobs marked as unknown.
Should clusters at other institutions not still return a job status?
Don't know why I was never able to get any feedback from Edinburgh or Manchester (even if the jobs did run and copied things to the grid storage systems).
gah. frustrating. We hope to ~double the Durham cluster soon, so there is less reason to run elsewhere and we can go and kick the equipment when it fails.
@GandalfTheWhite2 I'm a bit late to this and don't quite follow - which command was the second set of output produced by?
As for the headline 'Done: 0' problem, I think that should now be fixed in the latest revision (these bugs are all an indirect consequence of the move to multicore submission, so it's good that we're catching them while they're fresh).
The latter output is produced by pyHepGrid .... -I
and shows the status is correctly identified as Finished, even though the -s option doesn't list them as Done.
And the issue of registering them as Done is indeed solved by the latest release. Thanks.
Hmm, that's weird and I have no idea why there's a discrepancy between -Bs
and -BsI
. In principle -BsI
just runs arcstat
for all the jobs in the database entry so the two lists should probably be of consistent length! Does
pyHepGrid man pyHepGrid/hej_runcard.py -Bs
show the correct number of jobs in the relevant table entries, or does that change between 199 and 200 too?
With the latest version it is still changing between 199 and 200 - even if the total is reported as 200 with just one non-zero entry of 199: bash-4.2$ pyHepGrid man pyHepGrid/hej_runcard.py -Bs -j 120
Using header file HEJ.hej_header.py Value set: hej_runcard arc_submit_threads : 5 Value set: hej_runcard arcbase : .arc/jobs.dat Value set: hej_runcard baseSeed : 6171 Value set: hej_runcard base_dir : <function base_dir at 0x7f23ed70ec80> Value set: hej_runcard ce_base : ce1.dur.scotgrid.ac.uk Value set: hej_runcard dbname : hej_database Value set: hej_runcard dirac_name : marian.heil Value set: hej_runcard events : -1 Value set: hej_runcard finalisation_script : results/CombineRuns_main Value set: hej_runcard gfaldir : xroot:/se01.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/pheno/andersen/WJets_new Value set: hej_runcard grid_input_dir : Wjets/input Value set: hej_runcard grid_output_dir : Wjets/output Value set: hej_runcard grid_warmup_dir : Wjets/HEJ.tar.gz Value set: hej_runcard jobName : HEJ_Wjets Value set: hej_runcard local_run_directory : Wjets Value set: hej_runcard producRun : 200 Value set: hej_runcard provided_warmup_dir : Setup Value set: hej_runcard runcardDir : Setup Value set: hej_runcard split_dur_ce : False Arc Production [120] Wp4j_HT2_7TeV : config_all Done: 199 Waiting: 0 Running: 0 Failed: 0 Missing: 0 Total: 200
I suspect this may be to do with the caching of job statuses that we do internally, in order to speed up getting the status of jobs that we already know to have finished. Perhaps there is an inconsistency with the logging there, where the first/last job could be set as unknown and not being checked properly.
on another note, can I just say how impressed I am with the download speed obtained using --get_data. That's impressive.
The missing job might be a "missing" job in the sense that it was submitted to arc but not added to the database (see discussion in #31 and commit 4866f130d8791dda082200520c1d1f43f36b9380).
I just pushed a related fix (4704333b4da4d7211ea8a8bd37352994cbfb7d63). Prior to this missing (cMISS
) jobs statuses got reassigned to unknown (cUNK
). In the future they should now stay missing.
All cMISS
jobs should have the job_id="None"
. @GandalfTheWhite2 could you check if your missing job is one of the (reassigned) cMISS
, e.g by printing jobid
in _do_stats_job
inside the Backend.py
?
Also as @scarlehoff said before setting short_stats=False
in your runcard should show one job as "Unknown" (the short stat doesn't list cUNK
).
I haven't experienced this since the commits I mentioned above. So I assume this is fixed.
but