Closed EricR86 closed 7 years ago
Original comment by Rachel Chan (Bitbucket: rcwchan).
Job script file creation:
#!python
with open(job_script_filename, "w") as job_script_file:
print >>job_script_file, "#!/usr/bin/env bash"
# this doesn't include use of segway-wrapper, which takes the
# memory usage as an argument, and may be run multiple times
self.log_cmdline(gmtk_cmdline, args, job_script_file)
# set permissions for script to run
chmod(job_script_filename, JOB_SCRIPT_FILE_PERMISSIONS)
With the with
statement, the file resource should be automatically closed properly. So current theories are that in the job script file creation, chmod is slow and/or non-atomic. Possible alternatives include creating the file as presented in this example?
I am currently still unable to reproduce this, but will keep trying.
Original comment by Rachel Chan (Bitbucket: rcwchan).
Can now confirm that I was able to reproduce this issue.
Have tried using fchmod
and fsync
on the file handle and am still unable to resolve the issue.
EDIT: possibly related? https://github.com/PacificBiosciences/FALCON/issues/269 issues with NFS caching (sync vs async)
Original comment by Rachel Chan (Bitbucket: rcwchan).
The issue is most likely caused by NFS 3's default asynchronous behavior:
#!html
> The only thing I can think of is the fact that we change the
> permissions on the file and writing out the metadata doesn't actually
> complete by the time the separate process starts. When digging through
> NFS docs I found that by default this is asynchronous behaviour:
>
> " This default permits the server to reply to client requests as soon
> as it has processed the request and handed it off to the local file
> system, without waiting for the data to be written to stable storage."
>
> We've tried to fsync the file after changing permissions and before
> using it but we also found the following:
>
> "Finally, note that, for NFS version 3 protocol requests, a
> subsequent commit request from the NFS client at file close time, or
> at fsync() time, will force the server to write any previously
> unwritten data/metadata to the disk, and the server will not reply to
> the client until this has been completed, as long as sync behavior is
> followed. If async is used, the commit is essentially a no-op, since
> the server once again lies to the client, telling the client that the
> data has been sent to stable storage".
>
> If our scripts are indeed being stored on NFS then I can see why we
> can't really work around this issue.
mordor is indeed running on NFS 3. Apparently there are plans to eventually migrate to NFS 4.
Some things to note are: we might not get the error when running on the full cluster because of the time required to queue a job (rather than rapid-fire on a single node). Possible solutions suggested include: write only a single 'template' shell script and fill in the template as necessary for jobs: ie, queue the template script + arguments to fill in.
Users running on local will not get this error. It is difficult to reproduce even on a single node.
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
Suggestion 1: open with os.O_SYNC
. Something like:
#!python
JOB_SCRIPT_FILE_OPEN_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_SYNC | os.O_TRUNC
job_script_fd = os.open(job_script_filename, JOB_SCRIPT_FILE_OPEN_FLAGS, JOB_SCRIPT_FILE_PERMISSIONS)
with os.fdopen(job_script_fd, "w") as job_script_file:
This way you can set the mode at opening too instead of fchmod()
while it is open which is a bit odd.
Suggestion 2: segway-wrapper
can copy the file to its temporary directory (which we should document must be on local storage) and then run it from there.
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
Use standard coding practices with regard to imports if you implement something like that.
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
Another option is to avoid opening the file containing the job command for writing if the file already exists. There is already a mechanism in Segway for job resubmission for odd machine/environment issues and I feel like this is a good case for it. Potentially avoiding a rewrite of the job file might avoid a "text file busy".
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
Suggestion 3: Potentially run the command in segway-wrapper using "bash" in front of the file to be run.
Also recommended to open the file with flags instead of writing it out after opening. -@ericr86
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
Another backup idea: find a way to poll if the file is busy within segway-wrapper
. Try "Suggestion 1" and "Suggestion 3" above first
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
It's worth noting that subprocess.Popen
is not thread safe and may be the source of all of these issues. Notably there's a chance that multiple threads can end up having the same process ID.
See: http://pythonhosted.org/psutil/#popen-class http://stackoverflow.com/questions/21194380/is-subprocess-popen-not-thread-safe
It might be worth investigating using a more thread-safe option such as psutil or a backported Python 3 implementation of subprocess (https://pypi.python.org/pypi/subprocess32/).
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
Also from my investigation of this issue:
Not re-writing the file on resubmission doesn't seem to fix the issue
The files being submitted as a job work on their own when resubmitted after Segway is finished
Can't reproduce locally or as a submitted job (on NFS) when running everything as a single thread and forcing SEGWAY_NUM_LOCAL_JOBS to 1
Can reproduce locally with 10 threads and a max of 2 local jobs (SEGWAY_NUM_LOCAL_JOBS=2)
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
And Suggestion 3? [edited to make clearer distinction from Suggestion 1]
Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
Sometime after commit cffef3e60aad4348017875682c6033232f3da48d, I have been consistently getting the error "bad interpreter: Text file busy"
Specifically it will read: "/usr/bin/env: ... segway-wrapper.sh: ... gmtk-job-uuid.sh: bad interpreter: Text file busy" outputted as error from the submitted job.
Notably all the jobs were running on the same machine. It seems a little difficult to reproduce.
It has been noted on the mailing list that supposedly someone has managed to reproduce it locally on a non NFS drive. I cannot reproduce it on a non-nfs drive.