GMTK jobs report "bad interpreter: Text file busy"

EricR86 commented 8 years ago

Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Sometime after commit cffef3e60aad4348017875682c6033232f3da48d, I have been consistently getting the error "bad interpreter: Text file busy"

Specifically it will read: "/usr/bin/env: ... segway-wrapper.sh: ... gmtk-job-uuid.sh: bad interpreter: Text file busy" outputted as error from the submitted job.

Notably all the jobs were running on the same machine. It seems a little difficult to reproduce.

It has been noted on the mailing list that supposedly someone has managed to reproduce it locally on a non NFS drive. I cannot reproduce it on a non-nfs drive.

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

Job script file creation:

#!python

        with open(job_script_filename, "w") as job_script_file:
            print >>job_script_file, "#!/usr/bin/env bash"
            # this doesn't include use of segway-wrapper, which takes the
            # memory usage as an argument, and may be run multiple times
            self.log_cmdline(gmtk_cmdline, args, job_script_file)

        # set permissions for script to run
        chmod(job_script_filename, JOB_SCRIPT_FILE_PERMISSIONS)

With the with statement, the file resource should be automatically closed properly. So current theories are that in the job script file creation, chmod is slow and/or non-atomic. Possible alternatives include creating the file as presented in this example?

I am currently still unable to reproduce this, but will keep trying.

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

Can now confirm that I was able to reproduce this issue.

Have tried using fchmod and fsync on the file handle and am still unable to resolve the issue.

EDIT: possibly related? https://github.com/PacificBiosciences/FALCON/issues/269 issues with NFS caching (sync vs async)

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

The issue is most likely caused by NFS 3's default asynchronous behavior:

#!html

>  The only thing I can think of is the fact that we change the 
> permissions on the file and writing out the metadata doesn't actually 
> complete by the time the separate process starts. When digging through 
> NFS docs I found that by default this is asynchronous behaviour:
> 
>  " This default permits the server to reply to client requests as soon 
> as it has processed the request and handed it off to the local file 
> system, without waiting for the data to be written to stable storage."
> 
>  We've tried to fsync the file after changing permissions and before 
> using it but we also found the following:
> 
>  "Finally, note that, for NFS version 3 protocol requests, a 
> subsequent commit request from the NFS client at file close time, or 
> at fsync() time, will force the server to write any previously 
> unwritten data/metadata to the disk, and the server will not reply to 
> the client until this has been completed, as long as sync behavior is 
> followed. If async is used, the commit is essentially a no-op, since 
> the server once again lies to the client, telling the client that the 
> data has been sent to stable storage".
> 
>  If our scripts are indeed being stored on NFS then I can see why we 
> can't really work around this issue.

mordor is indeed running on NFS 3. Apparently there are plans to eventually migrate to NFS 4.

Some things to note are: we might not get the error when running on the full cluster because of the time required to queue a job (rather than rapid-fire on a single node). Possible solutions suggested include: write only a single 'template' shell script and fill in the template as necessary for jobs: ie, queue the template script + arguments to fill in.

Users running on local will not get this error. It is difficult to reproduce even on a single node.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

Suggestion 1: open with os.O_SYNC. Something like:

#!python

JOB_SCRIPT_FILE_OPEN_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_SYNC | os.O_TRUNC

job_script_fd = os.open(job_script_filename, JOB_SCRIPT_FILE_OPEN_FLAGS, JOB_SCRIPT_FILE_PERMISSIONS)
with os.fdopen(job_script_fd, "w") as job_script_file:

This way you can set the mode at opening too instead of fchmod() while it is open which is a bit odd.

Suggestion 2: segway-wrapper can copy the file to its temporary directory (which we should document must be on local storage) and then run it from there.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

Use standard coding practices with regard to imports if you implement something like that.

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Another option is to avoid opening the file containing the job command for writing if the file already exists. There is already a mechanism in Segway for job resubmission for odd machine/environment issues and I feel like this is a good case for it. Potentially avoiding a rewrite of the job file might avoid a "text file busy".

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

Suggestion 3: Potentially run the command in segway-wrapper using "bash" in front of the file to be run.

Also recommended to open the file with flags instead of writing it out after opening. -@ericr86

EricR86 commented 7 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

Another backup idea: find a way to poll if the file is busy within segway-wrapper. Try "Suggestion 1" and "Suggestion 3" above first

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

It's worth noting that subprocess.Popen is not thread safe and may be the source of all of these issues. Notably there's a chance that multiple threads can end up having the same process ID.

See: http://pythonhosted.org/psutil/#popen-class http://stackoverflow.com/questions/21194380/is-subprocess-popen-not-thread-safe

It might be worth investigating using a more thread-safe option such as psutil or a backported Python 3 implementation of subprocess (https://pypi.python.org/pypi/subprocess32/).

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Also from my investigation of this issue:

Not re-writing the file on resubmission doesn't seem to fix the issue
The files being submitted as a job work on their own when resubmitted after Segway is finished
Can't reproduce locally or as a submitted job (on NFS) when running everything as a single thread and forcing SEGWAY_NUM_LOCAL_JOBS to 1
Can reproduce locally with 10 threads and a max of 2 local jobs (SEGWAY_NUM_LOCAL_JOBS=2)

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

"Suggestion 1" seems to fix the problem locally - I cannot reproduce.

"Suggestion 1" does not seem to fix the problem running on NFS.

EricR86 commented 7 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

And Suggestion 3? [edited to make clearer distinction from Suggestion 1]

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Fixed in PR #71

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description* changed state from "new" to "resolved"

Fixed in PR #71

hoffmangroup / segway

GMTK jobs report "bad interpreter: Text file busy" #77