scarlehoff / pyHepGrid

Tool for distributed computing management geared towards HEP applications.
GNU General Public License v3.0
6 stars 4 forks source link

Jobs failing due to submission being too quick #79

Closed htruong0 closed 4 years ago

htruong0 commented 4 years ago

https://github.com/scarlehoff/pyHepGrid/blob/f2d36426cc24b2210d626567c57a8cdd32cec232/src/pyHepGrid/src/runArcjob.py#L293-L296

A small percentage of the time jobs submitted to arc fail because they are submitted to the same CE at the same time. Potential fix would be to add a small sleep (~0.1s) between job submissions.

jcwhitehead commented 4 years ago

Hi Henry, thanks for opening an issue. We multithreaded submission in #29 to speed up large jobs, which led to some problems when submitting to a single CE with too many threads. Could you let me know if you're multithreading submission, and if so, on how many threads?

The conclusion reached in #29 was that it should be ok with split_dur_ce=True and arc_submit_threads=5.

htruong0 commented 4 years ago

Hi James, I'm using the template header where arc_submit_threads=1 and split_dur_ce=True by default, so I don't think it's even multithreading the submission.

jcwhitehead commented 4 years ago

Interesting - think that's new if so. @marianheil have you noticed this?

I can do some testing later. It's possible that there's been some change to the CEs. I don't suppose you verified that adding the sleep you mentioned fixes this? If not I'll explore that when I have a look.

scarlehoff commented 4 years ago

If the threads are set to 1 it might be a problem on the ce side?

htruong0 commented 4 years ago

No I haven't actually tried adding in the sleep but Adam suggested it to me as a fix when I asked him why my jobs failed.

marianheil commented 4 years ago

So far I only ever saw this with multi threaded submission, but I haven't used the grid recently. There was an update to arc a few weeks ago, maybe that changed something.

For testing we should not randomly switch between ce1 and ce2, but submit to one as quick as possible (in one thread). In general we should properly (not randomly) alternate between ce1 and ce2 if we are submitting too quick.

DWalker487 commented 4 years ago

Not read this very closely, but I wonder if it couldn't also be r/w related to the arc job.dat file? I never managed to overload a ce with just one thread as far as I can remember.If it's on the gridui, and the filesystem is being hammered that might cause this kind of thing. Jobs would appear to fail when they were submitted faster than they could be written to the db. Putting the jobs file on the scratch disks helped a lot with this IME.

marianheil commented 4 years ago

Not read this very closely, but I wonder if it couldn't also be r/w related to the arc job.dat file?

No, this is the problem here. The database was correctly written, all local job informations are stored. The actual error @htruong0 got with arccat/pyHepGrid man runcard.py -p is Path "./simplerun.py" does not seem to exist on the grid node.

I got the same error a while back when using multi core submission. Adams diagnosed the error now and then:

It looks like you've managed to submit too quickly. If two requests for a job hit the CE at the same time, then ARC can't keep up with its arc job ids and they collide.

jcwhitehead commented 4 years ago

@marianheil We can slowly accumulate fixes to this in #80 .

marianheil commented 4 years ago

I was able to reproduce the error. Submission worked flawless for multiple thousand jobs (even with multicore submission) and then failed twice (once in 4/500 jobs and once in 4/100), both on ce1 with single core submission and no load balancing.

There is nothing obviously going wrong in these jobs, but all failed jobs where directly after each other. I will try slowing down the submission.

Full output:

All jobs ``` $ pyHepGrid man runcard_example.py -B -j4 > Using header file pyHepGrid.headers.template_header.py Sourcing runcard Value set: runcard_example arcbase : ../../../../../../scratch/mheil/tst_grid/jobs.dat Value set: runcard_example baseSeed : 2 Value set: runcard_example ce_base : ce1.dur.scotgrid.ac.uk Value set: runcard_example copy_log : True Value set: runcard_example dbname : ../../../../../../scratch/mheil/tst_grid/database > Be very careful if you're trying to override attributes that don't exist elsewhere. > Or even if they do. Value set: runcard_example exampleDir : . Value set: runcard_example executable_exe : executable_example.sh Value set: runcard_example executable_src_dir : . > Be very careful if you're trying to override attributes that don't exist elsewhere. > Or even if they do. Value set: runcard_example grid_executable : example/executable.tar.gz Value set: runcard_example grid_input_dir : example/input Value set: runcard_example grid_output_dir : example/output Value set: runcard_example jobName : EXAMPLE_RUN Value set: runcard_example producRun : 100 Value set: runcard_example provided_warmup_dir : . Value set: runcard_example runcardDir : . Value set: runcard_example runfile : simplerun.py Value set: runcard_example runmode : backend_example.ExampleProgram Value set: runcard_example split_dur_ce : False > Overriding run mode to > Arc Production gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/xD6NDmggE3wnXk5IKnL2N00mABFKDmABFKDmR3FKDmABFKDmnF2Lkn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/PLRLDmigE3wnXk5IKnL2N00mABFKDmABFKDmC4FKDmABFKDmU8DeNo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/5NrMDmjgE3wnXk5IKnL2N00mABFKDmABFKDmZ9FKDmABFKDmTZfUSo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/kwEODmkgE3wnXk5IKnL2N00mABFKDmABFKDmeEGKDmABFKDmliZjgm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/K02KDmmgE3wnXk5IKnL2N00mABFKDmABFKDmrJGKDmABFKDmWysoxm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/PayLDmngE3wnXk5IKnL2N00mABFKDmABFKDmvOGKDmABFKDmHXUprn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/1ZpMDmogE3wnXk5IKnL2N00mABFKDmABFKDmIWGKDmABFKDmCQ08Om gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/023NDmpgE3wnXk5IKnL2N00mABFKDmABFKDm5bGKDmABFKDmjhqfVn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/sOlKDmrgE3wnXk5IKnL2N00mABFKDmABFKDmfhGKDmABFKDmxNPBJo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/c73LDmsgE3wnXk5IKnL2N00mABFKDmABFKDmQnGKDmABFKDmQI568m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/Pc3MDmtgE3wnXk5IKnL2N00mABFKDmABFKDmPsGKDmABFKDmK1VVYm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/i8CODmugE3wnXk5IKnL2N00mABFKDmABFKDmHyGKDmABFKDmMPrrVn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/o8nKDmwgE3wnXk5IKnL2N00mABFKDmABFKDmK4GKDmABFKDmpQaZtm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/ynMLDmxgE3wnXk5IKnL2N00mABFKDmABFKDmONHKDmABFKDmoEJ74n gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/LwpLDmygE3wnXk5IKnL2N00mABFKDmABFKDmtnHKDmABFKDmlX07Pn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/cFpMDmzgE3wnXk5IKnL2N00mABFKDmABFKDmLEIKDmABFKDmiWbPwm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/TTeNDm0gE3wnXk5IKnL2N00mABFKDmABFKDmggIKDmABFKDmYl9oun gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/PtQODm1gE3wnXk5IKnL2N00mABFKDmABFKDmK3IKDmABFKDmgFRFWm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/LijKDm3gE3wnXk5IKnL2N00mABFKDmABFKDmCcJKDmABFKDmC97tMn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XCnLDm4gE3wnXk5IKnL2N00mABFKDmABFKDmL8JKDmABFKDmlv2Fim gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/fqdMDm5gE3wnXk5IKnL2N00mABFKDmABFKDmGYKKDmABFKDmrXpmcn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/kXVNDm6gE3wnXk5IKnL2N00mABFKDmABFKDmNoKKDmABFKDmSCv7kn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/0UQODm7gE3wnXk5IKnL2N00mABFKDmABFKDmX1KKDmABFKDmjSfdZn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/7M3KDm9gE3wnXk5IKnL2N00mABFKDmABFKDmbGLKDmABFKDmY7lukm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/l1WLDmAhE3wnXk5IKnL2N00mABFKDmABFKDmqULKDmABFKDm1LqUJm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/flOMDmBhE3wnXk5IKnL2N00mABFKDmABFKDmVmLKDmABFKDmD8KXKm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/Kb6MDmChE3wnXk5IKnL2N00mABFKDmABFKDmF4LKDmABFKDmTPHlZn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/wTFKDmEhE3wnXk5IKnL2N00mABFKDmABFKDmEKMKDmABFKDmW1J2Xm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/oLzKDmFhE3wnXk5IKnL2N00mABFKDmABFKDmdUMKDmABFKDmii8GQo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/OWAMDmGhE3wnXk5IKnL2N00mABFKDmABFKDmCaMKDmABFKDmjheVsn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/TE7MDmHhE3wnXk5IKnL2N00mABFKDmABFKDmcfMKDmABFKDmjKs23m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/ZuxNDmIhE3wnXk5IKnL2N00mABFKDmABFKDmRkMKDmABFKDmtbzmPn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/TYoKDmKhE3wnXk5IKnL2N00mABFKDmABFKDmTpMKDmABFKDm7A49bn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/7NsLDmLhE3wnXk5IKnL2N00mABFKDmABFKDmfuMKDmABFKDmU4pdCn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/OErMDmMhE3wnXk5IKnL2N00mABFKDmABFKDmgzMKDmABFKDmxcc6sn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/aCBODmNhE3wnXk5IKnL2N00mABFKDmABFKDm04MKDmABFKDmYNfj6m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/Hr0KDmPhE3wnXk5IKnL2N00mABFKDmABFKDm49MKDmABFKDmOWAK2m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/KIdMDmQhE3wnXk5IKnL2N00mABFKDmABFKDmQFNKDmABFKDmLrnDvm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/NuYNDmRhE3wnXk5IKnL2N00mABFKDmABFKDmYKNKDmABFKDmuNtK6m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/8MmKDmThE3wnXk5IKnL2N00mABFKDmABFKDmbPNKDmABFKDmGpP1Oo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/HoULDmUhE3wnXk5IKnL2N00mABFKDmABFKDmoUNKDmABFKDmd4KUEo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/As4LDmVhE3wnXk5IKnL2N00mABFKDmABFKDmnTFKDmABFKDmj6Kytn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/ThTMDmWhE3wnXk5IKnL2N00mABFKDmABFKDmvvFKDmABFKDmV9Z1Fn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/WKmMDmXhE3wnXk5IKnL2N00mABFKDmABFKDmsOGKDmABFKDmMvPXcn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/YCaNDmYhE3wnXk5IKnL2N00mABFKDmABFKDmeoGKDmABFKDmR14y4n gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/lTLODmZhE3wnXk5IKnL2N00mABFKDmABFKDmVNHKDmABFKDmTQMgnn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/naRKDmbhE3wnXk5IKnL2N00mABFKDmABFKDmwzHKDmABFKDmcZ6sXm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/HGmLDmchE3wnXk5IKnL2N00mABFKDmABFKDmnWIKDmABFKDmmXeIWn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/9odMDmdhE3wnXk5IKnL2N00mABFKDmABFKDmJnIKDmABFKDmnKhmcn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/bJ4MDmehE3wnXk5IKnL2N00mABFKDmABFKDmJ5IKDmABFKDmGlLzDm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/YCCODmfhE3wnXk5IKnL2N00mABFKDmABFKDmGKJKDmABFKDmCy9QQn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/YazKDmhhE3wnXk5IKnL2N00mABFKDmABFKDm8ZJKDmABFKDmAggp0n gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/PNTLDmihE3wnXk5IKnL2N00mABFKDmABFKDmHpJKDmABFKDmOpOKen gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/8GRMDmjhE3wnXk5IKnL2N00mABFKDmABFKDm42JKDmABFKDmCIUign gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/iv8MDmkhE3wnXk5IKnL2N00mABFKDmABFKDmPHKKDmABFKDmSMDuAn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/KhwNDmlhE3wnXk5IKnL2N00mABFKDmABFKDmVaKKDmABFKDmdtxvCn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/24JKDmnhE3wnXk5IKnL2N00mABFKDmABFKDmDoKKDmABFKDmzNorKn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/MCQLDmohE3wnXk5IKnL2N00mABFKDmABFKDma1KKDmABFKDmIrntjn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/DV3LDmphE3wnXk5IKnL2N00mABFKDmABFKDmqGLKDmABFKDmcTXfMn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/eU2MDmqhE3wnXk5IKnL2N00mABFKDmABFKDmdNLKDmABFKDmpCr6ym gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/J60NDmrhE3wnXk5IKnL2N00mABFKDmABFKDm2SLKDmABFKDmYJZVUn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/0lyKDmthE3wnXk5IKnL2N00mABFKDmABFKDmIYLKDmABFKDmxt4PHn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/6eqLDmuhE3wnXk5IKnL2N00mABFKDmABFKDm5cLKDmABFKDmk94lGn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CyoMDmvhE3wnXk5IKnL2N00mABFKDmABFKDmViLKDmABFKDmtcMrNo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/4wbNDmwhE3wnXk5IKnL2N00mABFKDmABFKDmToLKDmABFKDm5ktwfm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/gofKDmyhE3wnXk5IKnL2N00mABFKDmABFKDmXtLKDmABFKDmQhRsEm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/xqfLDmzhE3wnXk5IKnL2N00mABFKDmABFKDmJzLKDmABFKDmLaUa4m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/q6uMDm0hE3wnXk5IKnL2N00mABFKDmABFKDm83LKDmABFKDmu8NVpm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/vDnNDm1hE3wnXk5IKnL2N00mABFKDmABFKDmh9LKDmABFKDmeFxL7m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/HdqKDm3hE3wnXk5IKnL2N00mABFKDmABFKDmSGMKDmABFKDm1Ckb3m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/MfvLDm4hE3wnXk5IKnL2N00mABFKDmABFKDm8LMKDmABFKDmqeeOtn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/pbWMDm5hE3wnXk5IKnL2N00mABFKDmABFKDmiYMKDmABFKDm1H91Fn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/GwxMDm6hE3wnXk5IKnL2N00mABFKDmABFKDmoxMKDmABFKDmhEfEcm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/1TDODm7hE3wnXk5IKnL2N00mABFKDmABFKDmARNKDmABFKDmEznQbm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn < Fail gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn < Fail gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn < Fail gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm < Fail gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/D49KDmEiE3wnXk5IKnL2N00mABFKDmABFKDmGzFKDmABFKDmqWnI4m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/GUFMDmFiE3wnXk5IKnL2N00mABFKDmABFKDmT4FKDmABFKDmHQsJsm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/E2LNDmGiE3wnXk5IKnL2N00mABFKDmABFKDmK9FKDmABFKDmxdB0Km gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/OYOODmHiE3wnXk5IKnL2N00mABFKDmABFKDmcEGKDmABFKDmWngEum gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/lyBLDmJiE3wnXk5IKnL2N00mABFKDmABFKDmaJGKDmABFKDm4eiNIn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/7tWMDmKiE3wnXk5IKnL2N00mABFKDmABFKDmWOGKDmABFKDmuDUmEn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/zpTNDmLiE3wnXk5IKnL2N00mABFKDmABFKDmMPGKDmABFKDmpdkMpn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/FNXKDmNiE3wnXk5IKnL2N00mABFKDmABFKDm2PGKDmABFKDmxOyaKn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/455LDmOiE3wnXk5IKnL2N00mABFKDmABFKDmnQGKDmABFKDmfkKBUn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/ARiNDmPiE3wnXk5IKnL2N00mABFKDmABFKDmHSGKDmABFKDmamCA3n gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/cXOLDmRiE3wnXk5IKnL2N00mABFKDmABFKDmFTGKDmABFKDmZ4TNrn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/qdDNDmSiE3wnXk5IKnL2N00mABFKDmABFKDmnTGKDmABFKDmUTI6Kn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/3MQKDmUiE3wnXk5IKnL2N00mABFKDmABFKDmZVGKDmABFKDmrCncOo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKoLDmViE3wnXk5IKnL2N00mABFKDmABFKDmWfGKDmABFKDmD8vpPo gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/tyzMDmWiE3wnXk5IKnL2N00mABFKDmABFKDmVEHKDmABFKDmjZP5Wn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/xPvNDmXiE3wnXk5IKnL2N00mABFKDmABFKDmLJHKDmABFKDmML3kHn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/nzsKDmZiE3wnXk5IKnL2N00mABFKDmABFKDmQOHKDmABFKDmGBuR1m gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/gvgLDmaiE3wnXk5IKnL2N00mABFKDmABFKDmqVHKDmABFKDmG4uk6n gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/Us9LDmbiE3wnXk5IKnL2N00mABFKDmABFKDmGzHKDmABFKDmfu5Vwm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CnXMDmciE3wnXk5IKnL2N00mABFKDmABFKDmqVIKDmABFKDmMAxyDm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/xTiMDmdiE3wnXk5IKnL2N00mABFKDmABFKDmTzIKDmABFKDmxpncSm gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/weINDmeiE3wnXk5IKnL2N00mABFKDmABFKDmjMJKDmABFKDm4SiWPn ```
Failed jobs ``` $ pyHepGrid man runcard_example.py -B -j4 -fl fail > Using header file pyHepGrid.headers.template_header.py Sourcing runcard Value set: runcard_example arcbase : ../../../../../../scratch/mheil/tst_grid/jobs.dat Value set: runcard_example baseSeed : 2 Value set: runcard_example ce_base : ce1.dur.scotgrid.ac.uk Value set: runcard_example copy_log : True Value set: runcard_example dbname : ../../../../../../scratch/mheil/tst_grid/database > Be very careful if you're trying to override attributes that don't exist elsewhere. > Or even if they do. Value set: runcard_example exampleDir : . Value set: runcard_example executable_exe : executable_example.sh Value set: runcard_example executable_src_dir : . > Be very careful if you're trying to override attributes that don't exist elsewhere. > Or even if they do. Value set: runcard_example grid_executable : example/executable.tar.gz Value set: runcard_example grid_input_dir : example/input Value set: runcard_example grid_output_dir : example/output Value set: runcard_example jobName : EXAMPLE_RUN Value set: runcard_example producRun : 100 Value set: runcard_example provided_warmup_dir : . Value set: runcard_example runcardDir : . Value set: runcard_example runfile : simplerun.py Value set: runcard_example runmode : backend_example.ExampleProgram Value set: runcard_example split_dur_ce : False > Overriding run mode to > Arc Production > Job status filter: fail gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm $ pyHepGrid man runcard_example.py -B -j4 -fl fail -p --error > Printing information for job 4: dummy_folder (Production) [1/1] Path "./simplerun.py" does not seem to exist Path "./simplerun.py" does not seem to exist Path "./simplerun.py" does not seem to exist Path "./simplerun.py" does not seem to exist > $ pyHepGrid man runcard_example.py -B -j4 -fl fail -I > Retrieving information for job 4: dummy_folder (Production) [1/1] Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn Name: EXAMPLE_RUN State: Failed Specific state: FAILED Exit Code: 0 Job Error: LRMS error: (1) Job failed Owner: /C=UK/O=eScience/OU=Durham/L=eScience/CN=marian heil Other Messages: SubmittedVia=org.nordugrid.gridftpjob Queue: ce1 Requested Slots: 1 Stdin: /dev/null Stdout: stdout Stderr: stderr Computing Service Log Directory: testjob.log Submitted: 2020-06-10 13:08:33 End Time: 2020-06-10 13:09:35 Submitted from: 193.60.193.12:44312 Used CPU Time: Used Wall Time: Results must be retrieved before: 2020-06-13 13:09:35 Proxy valid until: 2020-06-11 09:42:02 Entry valid from: 2020-06-10 13:36:50 Entry valid for: 3 hours ID on service: QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn Service information URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(objectClass=*) (org.nordugrid.ldapng) Job status URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(nordugrid-job-globalid=gsiftp:\2f\2fce1.dur.scotgrid .ac.uk:2811\2fjobs\2fQKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn) (org.nordugrid.ldapng) Job management URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs (org.nordugrid.gridftpjob) Stagein directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn Stageout directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn Session directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QKKLDm9hE3wnXk5IKnL2N00mABFKDmABFKDmvTFKDmABFKDmIdWNIn Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn [33/563] Name: EXAMPLE_RUN State: Failed Specific state: FAILED Exit Code: 0 Job Error: LRMS error: (1) Job failed Owner: /C=UK/O=eScience/OU=Durham/L=eScience/CN=marian heil Other Messages: SubmittedVia=org.nordugrid.gridftpjob Queue: ce1 Requested Slots: 1 Stdin: /dev/null Stdout: stdout Stderr: stderr Computing Service Log Directory: testjob.log Submitted: 2020-06-10 13:08:34 End Time: 2020-06-10 13:09:35 Submitted from: 193.60.193.12:44326 Used CPU Time: Used Wall Time: Results must be retrieved before: 2020-06-13 13:09:35 Proxy valid until: 2020-06-11 09:42:02 Entry valid from: 2020-06-10 13:36:50 Entry valid for: 3 hours ID on service: XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn Service information URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(objectClass=*) (org.nordugrid.ldapng) Job status URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(nordugrid-job-globalid=gsiftp:\2f\2fce1.dur.scotgrid .ac.uk:2811\2fjobs\2fXbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn) (org.nordugrid.ldapng) Job management URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs (org.nordugrid.gridftpjob) Stagein directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn Stageout directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn Session directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/XbOMDmAiE3wnXk5IKnL2N00mABFKDmABFKDmsiFKDmABFKDmH3GpJn Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn Name: EXAMPLE_RUN State: Failed Specific state: FAILED Exit Code: 0 Job Error: LRMS error: (1) Job failed Owner: /C=UK/O=eScience/OU=Durham/L=eScience/CN=marian heil Other Messages: SubmittedVia=org.nordugrid.gridftpjob Queue: ce1 Requested Slots: 1 Stdin: /dev/null Stdout: stdout Stderr: stderr Computing Service Log Directory: testjob.log Submitted: 2020-06-10 13:08:35 End Time: 2020-06-10 13:09:35 Submitted from: 193.60.193.12:44336 Used CPU Time: Used Wall Time: Results must be retrieved before: 2020-06-13 13:09:35 Proxy valid until: 2020-06-11 09:42:02 Entry valid from: 2020-06-10 13:36:50 Entry valid for: 3 hours ID on service: w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn Service information URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(objectClass=*) (org.nordugrid.ldapng) Job status URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(nordugrid-job-globalid=gsiftp:\2f\2fce1.dur.scotgrid .ac.uk:2811\2fjobs\2fw1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn) (org.nordugrid.ldapng) Job management URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs (org.nordugrid.gridftpjob) Stagein directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn Stageout directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn Session directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/w1RNDmBiE3wnXk5IKnL2N00mABFKDmABFKDmzoFKDmABFKDmBspMPn Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm Name: EXAMPLE_RUN State: Failed Specific state: FAILED Exit Code: 0 Job Error: LRMS error: (1) Job failed Owner: /C=UK/O=eScience/OU=Durham/L=eScience/CN=marian heil Other Messages: SubmittedVia=org.nordugrid.gridftpjob Queue: ce1 Requested Slots: 1 Stdin: /dev/null Stdout: stdout Stderr: stderr Computing Service Log Directory: testjob.log Submitted: 2020-06-10 13:08:37 End Time: 2020-06-10 13:09:35 Submitted from: 193.60.193.12:44346 Used CPU Time: Used Wall Time: Results must be retrieved before: 2020-06-13 13:09:35 Proxy valid until: 2020-06-11 09:42:02 Entry valid from: 2020-06-10 13:36:50 Entry valid for: 3 hours ID on service: CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm Service information URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(objectClass=*) (org.nordugrid.ldapng) Job status URL: ldap://ce1.dur.scotgrid.ac.uk:2135/Mds-Vo-name=local,o=Grid??sub?(nordugrid-job-globalid=gsiftp:\2f\2fce1.dur.scotgrid.ac.uk:2811\2fjobs\2fCMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm) (org.nordugrid.ldapng) Job management URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs (org.nordugrid.gridftpjob) Stagein directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm Stageout directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm Session directory URL: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/CMOODmCiE3wnXk5IKnL2N00mABFKDmABFKDmztFKDmABFKDmcfFfxm Status of 4 jobs was queried, 4 jobs returned information ```
marianheil commented 4 years ago

After further testing this seems like arc is just acting weird. I can also get the error message with successful jobs e.g.:

$ arcstat -j /scratch/mheil/tst_grid/jobs.dat gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/q2K
KDmKwZ3wnXk5IKnL2N00mABFKDmABFKDmwaHKDmABFKDmrEEEzm
Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/q2KKDmKwZ3wnXk5IKnL2N00mABFKDmABFKDmwaHKDmABFKDmrEEEzm
 Name: EXAMPLE_RUN_155
 State: Finished
 Exit Code: 0

Status of 1 jobs was queried, 1 jobs returned information
$ arccat -e -j /scratch/mheil/tst_grid/jobs.dat gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/q2KKDmKwZ3wnXk5IKnL2N00mABFKDmABFKDmwaHKDmABFKDmrEEEzm
Path "./simplerun.py" does not seem to exist

And, even worse, I can get jobs that are marked as failed but produce output:

$ pyHepGrid man runcard_example.py -B -s -j16 -fl fail -ip --error
> Using header file pyHepGrid.headers.template_header.py
Sourcing runcard
Value set: runcard_example      arcbase         : ../../../../../../scratch/mheil/tst_grid/jobs.dat
Value set: runcard_example      baseSeed        : 2
Value set: runcard_example      copy_log        : True
Value set: runcard_example      dbname          : ../../../../../../scratch/mheil/tst_grid/database
   WARNING: exampleDir defined in runcard_example.py but not pyHepGrid.headers.template_header.py.
> Be very careful if you're trying to override attributes that don't exist elsewhere.
> Or even if they do.
Value set: runcard_example      exampleDir      : .
Value set: runcard_example      executable_exe  : executable_example.sh
Value set: runcard_example      executable_src_dir : .
   WARNING: grid_executable defined in runcard_example.py but not pyHepGrid.headers.template_header.py.
> Be very careful if you're trying to override attributes that don't exist elsewhere.
> Or even if they do.
Value set: runcard_example      grid_executable : example/executable.tar.gz
Value set: runcard_example      grid_input_dir  : example/input
Value set: runcard_example      grid_output_dir : example/output
Value set: runcard_example      jobName         : EXAMPLE_RUN
Value set: runcard_example      producRun       : 500
Value set: runcard_example      provided_warmup_dir : .
Value set: runcard_example      runcardDir      : .
Value set: runcard_example      runfile         : simplerun.py
Value set: runcard_example      runmode         : backend_example.ExampleProgram
> Overriding run mode to <class 'backend_example.ExampleProgram'>
> Arc Production
   WARNING: Applying job status filter. Please ensure you have run stats directly before this command to update job statuses.
> Job status filter: fail
[16]  dummy_folder        : config     Done:  496  Waiting:    0  Running:    0  Failed:    4  Missing:    0  Total:  500
> Retrieving information for job 16: dummy_folder (Production) [1/1]
Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/9OCMDmHwZ3wnXk5IKnL2N00mABFKDmABFKDmIGHKDmABFKDmYVN0Zm
 Name: EXAMPLE_RUN_153
 State: Failed
 Job Error: LRMS error: (1) Job failed

Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/724MDmIwZ3wnXk5IKnL2N00mABFKDmABFKDmTSHKDmABFKDmHQMBRn
 Name: EXAMPLE_RUN_154
 State: Failed
 Exit Code: 0
 Job Error: LRMS error: (1) Job failed

Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/QVVLDmLwZ3wnXk5IKnL2N00mABFKDmABFKDm4oHKDmABFKDmlRh0Tn
 Name: EXAMPLE_RUN_156
 State: Failed
 Exit Code: 0
 Job Error: LRMS error: (1) Job failed

Job: gsiftp://ce1.dur.scotgrid.ac.uk:2811/jobs/8X6MDmMwZ3wnXk5IKnL2N00mABFKDmABFKDme4HKDmABFKDmnbVPWo
 Name: EXAMPLE_RUN_157
 State: Failed
 Exit Code: 0
 Job Error: LRMS error: (1) Job failed

Status of 4 jobs was queried, 4 jobs returned information
> Printing information for job 16: dummy_folder (Production) [1/1]
Path "./simplerun.py" does not seem to exist
Path "./simplerun.py" does not seem to exist
Path "./simplerun.py" does not seem to exist
Path "./simplerun.py" does not seem to exist
>
$ gfal-ls xroot://se01.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/pheno/mheil/example/output/output-dummy_folder-config-154.tar.gz -l
-r--------   1 196100011 196100011       626 Jun 11 11:48 xroot://se01.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/pheno/mheil/example/output/output-dummy_folder-config-154.tar.gz

(note: the name always includes the seed)

Of the 500 jobs I submitted 5 did show Path "./simplerun.py" does not seem to exist, 4 where reported as failed, but only one (the first one that failed, seed 153) was not running/producing actual output.

Basically we can't trust the output from arc, and instead have to check the results explicitly (maybe something for #34). Considering that this only affects 1/500 jobs without an obvious solution (see discussion of #80), I suggest not spending more time on this and keeping it as a "known issue" until we actually know why this is happening.

jcwhitehead commented 4 years ago

I'm totally puzzled by that - whichever ARC file the Path X does not seem to exist message comes from, it should be followed immediately by exit 1 (although we're several versions of ARC behind master, so that might not be correct).

How is it still running?