pygridtools / drmaa-python

Python wrapper around the C DRMAA library.
Other
108 stars 28 forks source link

issue with lsf-drmaa #14

Open ink1 opened 10 years ago

ink1 commented 10 years ago

Hello, Could you think of a reason why drmaa-python and lsf-drmaa https://github.com/PlatformLSF/lsf-drmaa would segfault on RH 6.4 but not on opensuse 12.1? Trying some of the examples, I can see a job gets submitted (and run) but Python segfaults. Thanks

dan-blanchard commented 10 years ago

Nothing comes to mind, but you could always try using faulthandler to see if you can get more debugging information.

ink1 commented 10 years ago

thanks, will try. i could only trace it to the c call in drmaa/helpers.py

ink1 commented 10 years ago

I'm afraid I can't get any further with faulthandler than I already got with python tracing

login1 examples> ./test.sh 
Call to main on line 25 of example4my.py from line 48 of example4my.py
Call to __init__ on line 233 of /drmaa-0.7.6/drmaa/session.py from line 29 of example4my.py
Call to initialize on line 237 of /drmaa-0.7.6/drmaa/session.py from line 30 of example4my.py
Call to py_drmaa_init on line 70 of /drmaa-0.7.6/drmaa/wrappers.py from line 257 of /drmaa-0.7.6/drmaa/session.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 73 of /drmaa-0.7.6/drmaa/wrappers.py
Creating job template
Call to createJobTemplate on line 274 of /drmaa-0.7.6/drmaa/session.py from line 32 of example4my.py
Call to __init__ on line 156 of /drmaa-0.7.6/drmaa/session.py from line 284 of /drmaa-0.7.6/drmaa/session.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 163 of /drmaa-0.7.6/drmaa/session.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 147 of /drmaa-0.7.6/drmaa/helpers.py from line 33 of example4my.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 154 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 181 of /drmaa-0.7.6/drmaa/helpers.py from line 34 of example4my.py
Call to string_vector on line 302 of /drmaa-0.7.6/drmaa/helpers.py from line 183 of /drmaa-0.7.6/drmaa/helpers.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 183 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 147 of /drmaa-0.7.6/drmaa/helpers.py from line 35 of example4my.py
Call to to_drmaa on line 70 of /drmaa-0.7.6/drmaa/helpers.py from line 149 of /drmaa-0.7.6/drmaa/helpers.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 154 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to runJob on line 301 of /drmaa-0.7.6/drmaa/session.py from line 37 of example4my.py
Call to create_string_buffer on line 52 of /usr/lib64/python2.6/ctypes/__init__.py from line 313 of /drmaa-0.7.6/drmaa/session.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 314 of /drmaa-0.7.6/drmaa/session.py
Job <124235> is submitted to queue <short>.
./test.sh: line 7: 29578 Segmentation fault      python example4my.py

With faulthandler

login1 examples> ./test.sh 
Creating job template
Job <124254> is submitted to queue <short>.
Fatal Python error: Segmentation fault

Current thread 0x00002aca8d42dbe0:
  File "/drmaa-0.7.6/drmaa/helpers.py", line 299 in c
  File "/drmaa-0.7.6/drmaa/session.py", line 314 in runJob
  File "example4my.py", line 38 in main
  File "example4my.py", line 51 in <module>
./test.sh: line 14: 24698 Segmentation fault      python2.7 example4my.py

It is clear that runJob calls "c" which crashes when returning a call from the API. It does not look to me as an API problem since the c function is called a number of times prior to that. On the other hand, I can successully use the DRMAA API using the C example provided with lsf-drmaa and yet another Java application. Puzzling.

dan-blanchard commented 10 years ago

The c function is just a little helper method for calling any of the C DRMAA functions, so the segfault is definitely when it's interacting with the C library.

I've never used LSF before (or even really heard much about it), but I think the problem is that the LSF DRMAA implementation seems to be expecting the job ID to be a integer, whereas we're passing it as a string. I'm basing this on the fact that PyLSF library uses a long in their runJobRequest struct.

I just double-checked the DRMAA specifications, and they definitely say that job IDs should be strings, so this seems to be a mistake in the LSF DRMAA interface.

I would file an issue with the LSF people if you can, or try using PyLSF if you need Python bindings that work with LSF right away.

Although, did you say that this is working on openSUSE?

ink1 commented 10 years ago

Dan, thank you for looking into this. PyLSF is rather dated while we are running LSF 9.1.2. IBM/PlatformLSF have recently renewed their support for DRMAA which is validated for 9.1.2. They also released Python API but I'm trying to make Galaxy work and it needs Python DRMAA. I would not claim thorough testing but, yes, Python DRMAA seems to be working with LSF 9.1.2 on OpenSUSE 12.1 (python 2.7).

LSF DRMAA specifies job id as a char array. For example, https://github.com/PlatformLSF/lsf-drmaa/blob/master/sample/sub.c#L17 sets job_id length to

define MAX_LEN_JOBID 100

This should work similarly to your

    jid = create_string_buffer(128)
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)

in runJob function

For reference, LSF drmaa.h is here https://github.com/PlatformLSF/lsf-drmaa/blob/master/drmaa_utils/drmaa_utils/drmaa.h

On 13 June 2014 18:23, Dan Blanchard notifications@github.com wrote:

The c function is just a little helper method for calling any of the C DRMAA functions, so the segfault is definitely when it's interacting with the C library.

I've never used LSF before (or even really heard much about it), but I think the problem is that the LSF DRMAA implementation seems to be expecting the job ID to be a integer, whereas we're passing it as a string. I'm basing this on the fact that PyLSF library uses a long in their runJobRequest struct https://github.com/gmccance/pylsf/blob/master/pylsf/pylsf.pyx#L1401.

I just double-checked the DRMAA specifications http://www.ogf.org/documents/GFD.130.pdf, and they definitely say that job IDs should be strings, so this seems to be a mistake in the LSF DRMAA interface.

I would file an issue with the LSF people if you can, or try using PyLSF https://github.com/gmccance/pylsf/ if you need Python bindings that work with LSF right away.

Although, did you say that this is working on openSUSE?

— Reply to this email directly or view it on GitHub https://github.com/drmaa-python/drmaa-python/issues/14#issuecomment-46037455 .

dan-blanchard commented 10 years ago

I see that Galaxy has version 0.6 of our library set as what they require. Is that what you're using? I'm curious if you see that same issues with 0.7.6.

I've also just submitted a PR to that project to update their version to 0.7.6 (or at least, I tried to but bitbucket is being very slow with the forking at the moment).

If LSF limits the JOBID length at 100, that might explain why you get a segfault (since we pass a string buffer of 128 characters). Although, that wouldn't explain why it works on OpenSUSE and not on RedHat...

Maybe try modifying your locally installed copy of DRMAA Python to change the buffer there to be 100 characters and see if that works. If that works, I'll modify the DRMAA Python to allow you set an environment variable that controls how long the buffer can be.

ink1 commented 10 years ago

Yes, once I identified where the problem is with Galaxy I switched to debugging 0.7.6.

Indeed, LSF drmaa.h defines

define DRMAA_ERROR_STRING_BUFFER 4096

define DRMAA_JOBNAME_BUFFER 128

define DRMAA_SIGNAL_BUFFER 32

whereas drmaa/const.py sets ERROR_STRING_BUFFER = 1024 JOBNAME_BUFFER = 1024 SIGNAL_BUFFER = 32

so I've already tried changing the latter to ERROR_STRING_BUFFER = 4096 JOBNAME_BUFFER = 128 SIGNAL_BUFFER = 32 but no luck.

I've also tried reducing string buffer in runJob in drmaa/session.py from 128 to 100 jid = create_string_buffer(128) c(drmaa_run_job, jid, sizeof(jid), jobTemplate) but that also did not help. I have re-built LSF DRMAA library with --enable-debug in addition to python tracing and observed

< cut >
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 314 of
/drmaa-0.7.6/drmaa/session.py
t #261c [     0.00] -> drmaa_run_job(jt=0xc44e00)
t #261d [     0.01] -> fsd_job_set_get(job_id=124287)
t #261d [     0.01] <- fsd_job_set_get(job_id=124287) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124289)
t #261d [     0.01] <- fsd_job_set_get(job_id=124289) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124290)
t #261d [     0.01] <- fsd_job_set_get(job_id=124290) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124291)
t #261d [     0.01] <- fsd_job_set_get(job_id=124291) =NULL
t #261d [     0.01] <- lsfdrmaa_session_update_all_jobs_status
d #261c [     0.01]  |   command: "exec" "hostname"
d #261c [     0.01]  |   numProcessors: 0
d #261c [     0.01]  |   maxNumProcessors: 0
d #261c [     0.01]  |   errFile: /dev/null
d #261c [     0.01]  |   jsdlDoc: (null)
Job <124292> is submitted to queue <short>.
d #261c [     0.02]  * lsb_submit( 0xe030e0, 0x7fff5dd05390 ) = 124292[0]
t #261c [     0.02] -> fsd_job_new(124292)
t #261c [     0.02] <- fsd_job_new=0xe061e0: ref_cnt=1 [lock 124292]
t #261c [     0.02] -> fsd_job_set_add(job=0xe061e0, job_id=124292)
t #261c [     0.02] <- fsd_job_set_add: job->ref_cnt=2
t #261c [     0.02] -> fsd_job_release(0xe061e0={job_id=124292, ref_cnt=2})
[unlock 124292]
t #261c [     0.02] <- fsd_job_release
./test.sh: line 16:  9756 Segmentation fault      python2.7 example4my.py

Running the sample C code from LSF -DRMAA produces the following

< cut >
Job <124296> is submitted to queue <normal>.
d #4357 [     0.02]  * lsb_submit( 0x1076c00, 0x7fffda923240 ) = 124296[0]
t #4357 [     0.02] -> fsd_job_new(124296)
t #4357 [     0.02] <- fsd_job_new=0x1079b80: ref_cnt=1 [lock 124296]
t #4357 [     0.02] -> fsd_job_set_add(job=0x1079b80, job_id=124296)
t #4357 [     0.02] <- fsd_job_set_add: job->ref_cnt=2
t #4357 [     0.02] -> fsd_job_release(0x1079b80={job_id=124296,
ref_cnt=2}) [unlock 124296]
t #4357 [     0.02] <- fsd_job_release
t #4357 [     0.02] <- drmaa_run_job =0: job_id=124296
< cut >

So the next line above after fsd_job_release should have been a return from drmaa_run_job. This means it is likely that the segfault happens on return to Python.

ink1 commented 10 years ago

Tested Python DRMAA on another cluster - SLES 11 SP1, python 2.6. Job submission works even without the changes to the constants. Still not clear what exactly makes the crucial difference.

jakirkham commented 6 years ago

Have been using a recent copy of lsf-drmaa and drmaa-python without issues. So maybe this was fixed at some point in one of them?

zihhuafang commented 4 years ago

I am having issue with lsf-drmaa(1.1.1) and drmaa-python (0.7.9) with string formatting. I ran the following test script to see what might causes the issue

!/usr/bin/env python

import drmaa

def main(): """ Query the system. """ with drmaa.Session() as s: print('A DRMAA object was created') print('Supported contact strings: %s' % s.contact) print('Supported DRM systems: %s' % s.drmsInfo) print('Supported DRMAA implementations: %s' % s.drmaaImplementation) print('Version %s' % s.version)

    print('Exiting')

if name=='main': main()

I got the following error message:

A DRMAA object was created Supported contact strings: Supported DRM systems: IBM Spectrum LSF 10.1 Supported DRMAA implementations: FedStage DRMAA for LSF 1.1.1 Traceback (most recent call last): File "./test.py", line 17, in main() File "./test.py", line 12, in main print('Version %s' % s.version) TypeError: not all arguments converted during string formatting

Anyone has an idea how to fix it?