Open andre-merzky opened 11 years ago
PS.: I actually don't want to list the CU description here -- again, my hunch is that it is incorrect. But I don't want to find this out by guessing -- that is not an option when using BigJob as Troy backend, I need to be able to identify and report an error...
Andre, when you say it is in the pending state forever, is this the BigJob output to stdout or what is obtained from using the CU.get_state()? There is not a lot of robust error checking in this area (if any) - but without the CUD, I really don't know what is causing the problem therefore I can't really fix it. In the ideal scenario, the CU would enter Failed state so you get some feedback. Also, out of curiosity, if you do not bind this directly, i.e. use ComputeDataService, is the result the same? I haven't really done any digging in the code on this yet. Further, if you kill your running job, and check the agent file, do you get the information about the CU for which you seek (either agent level or subjob level)? I know this is not the solution you're looking for - but I am trying to see if there is any info in the output.
Hi Melissa,
I am calling cu.get_state(), and see cu.get_state()
returning New
forever. The only message BJ is printing in DEBUG mode is
10/24/2013 10:32:55 PM - bigjob - DEBUG - Get subjob state: bigjob:bj-6720d45a-3ceb-11e3-98f0-00231582da34:localhost:jobs:sj-6dd844c2-3ceb-11e3-98f0-00231582da34
which only seems to confirm that a status check is dispatched.
The CUD is
submitting CU:
{'Executable': '/usr/bin/touch', 'dtype': 2, 'number_of_processes': 1, 'Arguments': '/tmp/sinon_bj_touch', 'Error': '/tmp/bjstderr.txt', 'Output': '/tmp/bjstdout.txt', 'SPMDVariation': 'single'}
But again, the purpose of the ticket is not really to debug the problem -- but I want to know what I can do programatically in Troy to find out if a Unit is still alive or not...
I did not try to submit via the ComputeDataService, so not sure if that would look different (we don't use that one in Troy).
I now saw an error in the agent log (not sure why I did no see that before?). It complete stderr log is
Traceback (most recent call last):
File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 1967, in <module>
main()
File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 817, in main
never_download=options.never_download)
File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 908, in create_environment
site_packages=site_packages, clear=clear))
File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 1117, in install_python
shutil.copyfile(executable, py_executable)
File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 26] Text file busy: '/home/merzky/.bigjob/python/bin/python'
/bin/sh: 1: aprun: not found
/bin/sh: 1: ibrun: not found
/bin/sh: 1: srun: not found
Traceback (most recent call last):
File "/home/merzky/.bigjob/python/lib/python2.7/site-packages/BigJob-0.50-py2.7.egg/bigjob/bigjob_agent.py", line 410, in execute_job
executable = job_dict["Executable"]
KeyError: 'Executable'
Traceback (most recent call last):
File "/home/merzky/.bigjob/python/lib/python2.7/site-packages/BigJob-0.50-py2.7.egg/bigjob/bigjob_agent.py", line 410, in execute_job
executable = job_dict["Executable"]
KeyError: 'Executable'
Not sure what the 'busy python' message means -- but it seems to dislike the Description indeed. Not sure why -- see above, 'Executable' is defined. I tried to change the agent code to print the description at this -- but since the agent is pulled from pypi, it seems to have no effect, or at least the print seems to be ignored no matter where and how I install.
But again, I don't actually want to fix this -- I need to find out how to handle errors like this... I feel like I am too deep down the rabbit hole again anyways... :/
Welp, while I understand your point, I think that is a point for @drelu to comment on, because I am not 100% sure how you can query it other than get_state()
The error on the other hand is that the dictionary keys are lowercase, and you're trying to use an uppercase key. PS is dtype something you added?
Old-school Dictionary Style:
"executable": "/bin/echo",
"arguments": ["Hello", "$ENV1", "$ENV2"],
"environment": {'ENV1':"env_arg1","ENV2" : "env_arg2"},
"number_of_processes": 1,
#"spmd_variation":"single",
"output": "stdout.txt",
"error": "stderr.txt",
New-school Variable Style:
task_desc = pilot.ComputeUnitDescription()
task_desc.executable = '/bin/echo'
task_desc.arguments = ['I am task number $TASK_NO', ]
task_desc.environment = {'TASK_NO': i}
task_desc.number_of_processes = 1
task_desc.output = 'simple-ensemble-stdout.txt'
task_desc.error = 'simple-ensemble-stderr.txt'
Ah, so bigjob is using Executable
internally, but executable
on the API -- that makes sense. :D
That notwithstanding, lets wait for AL to comment...
Thanks! A.
AndreM: What is the status of this? You marked this as documentation issue (meaning it belongs to me), but afaik, we were waiting for AndreL to comment.
Well, I fixed the CU description, so things work -- however, I think I still don't understand how errors are to be handled. Not sure if I care anymore at this stage though ;) So, feel free to close the ticket.
Thanks, Andre.
I am submitting a ComputeUnit to a BJ pilot, i.e. via direct submission. I is likely that my ComputeUnitDescription is incorrect / incomplete, but then I would have expected an error. I get, however, a valid Unit instance, and it does never enter 'FAILED' state -- in fact, it remains 'PENDING' forever. I don't see any traces in the bigjob log, nor in the agent logs -- the agent working dir remains empty (local agent via ssh). After printing the jd dict, I see not trace of the CU whatsoever.
How can I find out what happens to the CU, w/o using a debugger or sifting through redis? What is the correct way to get submission errors / runtime errors for CUs?