saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

What happens to my ComputeUnit? #156

Open andre-merzky opened 10 years ago

andre-merzky commented 10 years ago

I am submitting a ComputeUnit to a BJ pilot, i.e. via direct submission. I is likely that my ComputeUnitDescription is incorrect / incomplete, but then I would have expected an error. I get, however, a valid Unit instance, and it does never enter 'FAILED' state -- in fact, it remains 'PENDING' forever. I don't see any traces in the bigjob log, nor in the agent logs -- the agent working dir remains empty (local agent via ssh). After printing the jd dict, I see not trace of the CU whatsoever.

How can I find out what happens to the CU, w/o using a debugger or sifting through redis? What is the correct way to get submission errors / runtime errors for CUs?

andre-merzky commented 10 years ago

PS.: I actually don't want to list the CU description here -- again, my hunch is that it is incorrect. But I don't want to find this out by guessing -- that is not an option when using BigJob as Troy backend, I need to be able to identify and report an error...

melrom commented 10 years ago

Andre, when you say it is in the pending state forever, is this the BigJob output to stdout or what is obtained from using the CU.get_state()? There is not a lot of robust error checking in this area (if any) - but without the CUD, I really don't know what is causing the problem therefore I can't really fix it. In the ideal scenario, the CU would enter Failed state so you get some feedback. Also, out of curiosity, if you do not bind this directly, i.e. use ComputeDataService, is the result the same? I haven't really done any digging in the code on this yet. Further, if you kill your running job, and check the agent file, do you get the information about the CU for which you seek (either agent level or subjob level)? I know this is not the solution you're looking for - but I am trying to see if there is any info in the output.

andre-merzky commented 10 years ago

Hi Melissa,

I am calling cu.get_state(), and see cu.get_state() returning New forever. The only message BJ is printing in DEBUG mode is

10/24/2013 10:32:55 PM - bigjob - DEBUG - Get subjob state: bigjob:bj-6720d45a-3ceb-11e3-98f0-00231582da34:localhost:jobs:sj-6dd844c2-3ceb-11e3-98f0-00231582da34

which only seems to confirm that a status check is dispatched.

The CUD is

submitting CU:
{'Executable': '/usr/bin/touch', 'dtype': 2, 'number_of_processes': 1, 'Arguments': '/tmp/sinon_bj_touch', 'Error': '/tmp/bjstderr.txt', 'Output': '/tmp/bjstdout.txt', 'SPMDVariation': 'single'}

But again, the purpose of the ticket is not really to debug the problem -- but I want to know what I can do programatically in Troy to find out if a Unit is still alive or not...

I did not try to submit via the ComputeDataService, so not sure if that would look different (we don't use that one in Troy).

I now saw an error in the agent log (not sure why I did no see that before?). It complete stderr log is

Traceback (most recent call last):
  File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 1967, in <module>
    main()
  File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 817, in main
    never_download=options.never_download)
  File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 908, in create_environment
    site_packages=site_packages, clear=clear))
  File "/home/merzky/.bigjob/bigjob-bootstrap.py", line 1117, in install_python
    shutil.copyfile(executable, py_executable)
  File "/usr/lib/python2.7/shutil.py", line 83, in copyfile
    with open(dst, 'wb') as fdst:
IOError: [Errno 26] Text file busy: '/home/merzky/.bigjob/python/bin/python'
/bin/sh: 1: aprun: not found
/bin/sh: 1: ibrun: not found
/bin/sh: 1: srun: not found
Traceback (most recent call last):
  File "/home/merzky/.bigjob/python/lib/python2.7/site-packages/BigJob-0.50-py2.7.egg/bigjob/bigjob_agent.py", line 410, in execute_job
    executable = job_dict["Executable"]
KeyError: 'Executable'
Traceback (most recent call last):
  File "/home/merzky/.bigjob/python/lib/python2.7/site-packages/BigJob-0.50-py2.7.egg/bigjob/bigjob_agent.py", line 410, in execute_job
    executable = job_dict["Executable"]
KeyError: 'Executable'

Not sure what the 'busy python' message means -- but it seems to dislike the Description indeed. Not sure why -- see above, 'Executable' is defined. I tried to change the agent code to print the description at this -- but since the agent is pulled from pypi, it seems to have no effect, or at least the print seems to be ignored no matter where and how I install.

But again, I don't actually want to fix this -- I need to find out how to handle errors like this... I feel like I am too deep down the rabbit hole again anyways... :/

melrom commented 10 years ago

Welp, while I understand your point, I think that is a point for @drelu to comment on, because I am not 100% sure how you can query it other than get_state()

The error on the other hand is that the dictionary keys are lowercase, and you're trying to use an uppercase key. PS is dtype something you added?

Old-school Dictionary Style:

            "executable": "/bin/echo",
            "arguments": ["Hello", "$ENV1", "$ENV2"],
            "environment": {'ENV1':"env_arg1","ENV2" : "env_arg2"},
            "number_of_processes": 1,
            #"spmd_variation":"single",
            "output": "stdout.txt",
            "error": "stderr.txt",

New-school Variable Style:

        task_desc = pilot.ComputeUnitDescription()
        task_desc.executable = '/bin/echo'
        task_desc.arguments = ['I am task number $TASK_NO', ]
        task_desc.environment = {'TASK_NO': i}
        task_desc.number_of_processes = 1
        task_desc.output = 'simple-ensemble-stdout.txt'
        task_desc.error = 'simple-ensemble-stderr.txt'
andre-merzky commented 10 years ago

Ah, so bigjob is using Executable internally, but executable on the API -- that makes sense. :D

That notwithstanding, lets wait for AL to comment...

Thanks! A.

melrom commented 10 years ago

AndreM: What is the status of this? You marked this as documentation issue (meaning it belongs to me), but afaik, we were waiting for AndreL to comment.

andre-merzky commented 10 years ago

Well, I fixed the CU description, so things work -- however, I think I still don't understand how errors are to be handled. Not sure if I care anymore at this stage though ;) So, feel free to close the ticket.

Thanks, Andre.