Closed sina-masoud-ansari closed 13 years ago
this is weird. What is the reason for job failure in those cases?
I can't imagine how that could ever happen. Are you sure that the timestamp ones are not old ones? I think we switched over from timestamped jobs to numbered once a few weeks ago...
Can you clear all jobs and submit a single job and confirm that behaviour? Also, is it different when you give the job a unique name yourself?
Markus, the code was supposed to work with old backend that does not support numbered job names. It checks for exceptions when trying to submit job, and resubmits when job submission failed. I think that hack should be removed now since both prod and dev support numbered job names. Can you do it, or I will do it tomorrow.
Probably not that urgent and better you do it since you know what you are talking about :-)
So it is on the dev backend that I can see the renaming.
$ java -jar gricli-binary.jar gricli> print jobs gricli : Done gricli_1 : Done gricli> quit
$ java -jar gricli-binary.jar -b BeSTGRID-DEV gricli> print jobs gricli : Done gricli_1 : Done gricli_1305595402743 : Failed gricli_2 : Done sina_test_1305595484462 : Failed sina_test_1305595617711 : Failed sina_test_1305595667002 : Failed sina_test_1305595699234 : Failed sina_test_1305595724605 : Failed testagain_1305595998407 : Done testagain_1305596041689 : Done testagain_1305596153647 : Done gricli>
I'm not sure what the failure reasons were but here are the details for a sleep job:
gricli> print job sina_test_1305595484462 Printing details for job sina_test_1305595484462 status: Failed application : generic applicationVersion : any_version commandline : sleep 1000 cpus : 1 email_address : email_on_finish : false email_on_start : false executable : sleep factoryType : PBS fqan : /nz/nesi hostCount : -1 inputFilesUrls : jobDirectory : gsiftp://gram5.ceres.auckland.ac.nz/home/smas036/grisu-test/sina_test_1305595484462 jobname : sina_test memory : 2048 modules : mountpoint : gsiftp://gram5.ceres.auckland.ac.nz/home/smas036 pbsDebug : queue : default stagingFileSystem : gsiftp://gram5.ceres.auckland.ac.nz stderr : stderr.txt stdout : stdout.txt submissionHost : gram5.ceres.auckland.ac.nz submissionLocation : default:gram5.ceres.auckland.ac.nz submissionSite : Auckland submissionTime : 1305595496572 submissionType : GT5 walltime : 10 workingDirectory : /home/smas036/grisu-test/sina_test_1305595484462
Most of those jobs were over a couple of weeks old. Auto naming submission and custom naming submission seem to be working fine.
Wondering if the last few jobs in this list, off griclish-dev, are related:
gricli> print jobs RMPISNOW_job_eg : Done catjob-ndj : Done (ExitCode: 1) catjob-ndj-my : Done (ExitCode: 1) javajob : Done (ExitCode: 127) javajob_1 : Done (ExitCode: 127) p110a_lead4x_2011.03.24_11.39.572 : Done szybkieg2_2011.03.10_17.55.662 : Done
gricli> print job p110a_lead4x_2011.03.24_11.39.572 Printing details for job p110a_lead4x_2011.03.24_11.39.572 status: Done application : Gold applicatpionVersion : 5.0 commandline : sh gold.sh p110a_lead4x.conf concatenated_output : ./p110a_lead4x_out.sdf cpus : 2 email_address : n.jones@auckland.ac.nz email_on_finish : true email_on_start : true executable : sh factoryType : PBS fqan : /ARCS/BeSTGRID/Drug_discovery/Local hostCount : 0 inputFilesUrls : gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572/gold.sh,gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572/gold.py,gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572/chemscore_kin.params,gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572/p110a_lead4x.conf,gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572/alpha_correct.mol2 jobDirectory : gsiftp://ng2.auckland.ac.nz/home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572 jobname : p110a_lead4x_2011.03.24_11.39.572 memory : 8589934592 modules : mountpoint : gsiftp://ng2.auckland.ac.nz/home/njon001 pbsDebug : queue : gold@er171.ceres.auckland.ac.nz result_directory : ./Results stagingFileSystem : gsiftp://ng2.auckland.ac.nz stderr : stderr.txt stdout : stdout.txt submissionHost : ng2.auckland.ac.nz submissionLocation : gold@er171.ceres.auckland.ac.nz:ng2.auckland.ac.nz submissionSite : Auckland submissionTime : 1300920002085 submissionType : GT4 walltime : 60 workingDirectory : /home/njon001/grisu-jobs/p110a_lead4x_2011.03.24_11.39.572 gricli>
Probably... Let's wait until Yuriy removes the offending code and then have a look at all the newly submitted jobs...
On 11/07/11 23:31, smas036 wrote:
Jobs seem to be renamed at some stage, from small integer endings eg gricli_1 to what looks like a timestamp...
gricli> print jobs gricli : Done gricli_1 : Done gricli_1305595402743 : Failed gricli_2 : Done sina_test_1305595484462 : Failed sina_test_1305595617711 : Failed sina_test_1305595667002 : Failed sina_test_1305595699234 : Failed sina_test_1305595724605 : Failed testagain_1305595998407 : Done testagain_1305596041689 : Done testagain_1305596153647 : Done
Interesting - I was reporting this in May/June and I thought it was fixed - so that jobs got named exactly after what was set in the job name, with no timestamp attached. Interesting you got this again...
Cheers, Vlad
Jobs seem to be renamed at some stage, from small integer endings eg gricli_1 to what looks like a timestamp...
gricli> print jobs gricli : Done gricli_1 : Done gricli_1305595402743 : Failed gricli_2 : Done sina_test_1305595484462 : Failed sina_test_1305595617711 : Failed sina_test_1305595667002 : Failed sina_test_1305595699234 : Failed sina_test_1305595724605 : Failed testagain_1305595998407 : Done testagain_1305596041689 : Done testagain_1305596153647 : Done
Would the backend I use to submit / check play a role?