Open certik opened 11 years ago
This is Python's shutil.rmtree
failing to remove a directory. Either a bug in shutil
, or something to do with permissions (the build removes write permissions for a file?), or the network filesystem acting weirdly. My guess is for the latter...and if so, it sounds like something we can't fix in Hashdist.
Is it reproducible? Could you try to shutil.rmtree the directory? What about rm -r
(no -f
)? What about rm -rf
?
I think nfs sometimes leaves files around, temporarily. So we should try yo
use rm -rf
or something similarly robust.
Sent from my mobile phone. On Sep 29, 2013 4:11 AM, "Dag Sverre Seljebotn" notifications@github.com wrote:
This is Python's shutil.rmtree failing to remove a directory. Either a bug in shutil, or something to do with permissions (the build removes write permissions for a file?), or the network filesystem acting weirdly. My guess is for the latter...and if so, it sounds like something we can't fix in Hashdist.
Is it reproducible? Could you try to shutil.rmtree the directory? What about rm -r (no -f)? What about rm -rf?
— Reply to this email directly or view it on GitHubhttps://github.com/hashdist/hashdist/issues/113#issuecomment-25317719 .
So this error seems to be a known bug/feature of shutil.rmtree
, see here:
http://code.activestate.com/lists/python-list/159050/
where they have exactly the same problem (note the comment that they switched to NFS then the problem started to occur).
The error is randomly reproducible --- i.e. it happens once in a while. It just happened after a long build of a package and I have to build it again, so this bug is extremely annoying. I'll see if I can fix it using the linux rm -rf
, specifically I am testing the following patch:
diff --git a/hashdist/core/build_store.py b/hashdist/core/build_store.py
index 522e7fb..a945a1d 100644
--- a/hashdist/core/build_store.py
+++ b/hashdist/core/build_store.py
@@ -492,7 +492,7 @@ class BuildStore(object):
def remove_build_dir(self, build_dir):
self.logger.debug('Removing build dir: %s' % build_dir)
- shutil.rmtree(build_dir)
+ os.system("rm -rf %s" % build_dir)
class ArtifactBuilder(object):
def __init__(self, build_store, build_spec, virtuals):
Also relevant: http://stackoverflow.com/questions/11228079/python-remove-directory-error-file-exists, they say that this feature of NFS can't be easily fixed. So the conclusion is that we can't fail like this in hashdist.
It says it can fail due to us holding a file descriptor open. Perhaps you could poke around and see if there's any file descriptors we should close (lsof
could help, or by reading the code...)
If we don't have any descriptors open, we could attempt sleeping and re-running shutil.rmtree
a couple of times in a loop shrug.
Looks like the easybuild folks did the loop thing. https://github.com/hpcugent/easybuild-framework/pull/353
I'm thinking about what our robust options are. Using rm -rf
is not portable on non-UNIX systems, so I'd prefer to handle this within Python if possible. I'll post a PR to try out in a few minutes.
The OpenStack folks also do a loop, but also explicitly check for stale NFS files. Let's fix with a sleep-loop for now and come back to the later if it's still a problem.
+1
@certik - Please check if this solves this.
So I got hit by this again:
certik@ml-fey2:~/repos/hashstack(moonlight)$ hit build -j8
/yellow/users/certik/repos/hashdist/hashdist/formats/marked_yaml.py:72: DeprecationWarning: object.__init__() takes no parameters
cls.__init__(self, x)
[cmake] Building cmake/7j6vg4fc4ohx, follow log with:
[cmake] tail -f /panfs/scratch/avol8/certik/h/tmp/cmake-7j6vg4fc4ohx-1/build.log
[CRITICAL] Uncaught exception:
[CRITICAL] Traceback (most recent call last):
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/cli/main.py", line 202, in help_on_exceptions
[CRITICAL] return func(*args, **kw)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/cli/main.py", line 174, in command_line_entry_point
[CRITICAL] retcode = args.subcommand_handler(ctx, args)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/cli/frontend_cli.py", line 51, in run
[CRITICAL] self.profile_builder_action()
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/cli/frontend_cli.py", line 108, in profile_builder_action
[CRITICAL] self.args.k, self.args.debug)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/spec/builder.py", line 150, in build
[CRITICAL] keep_build=keep_build, debug=debug)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 379, in ensure_present
[CRITICAL] artifact_dir = builder.build(config, keep_build)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 554, in build
[CRITICAL] self.build_to(artifact_dir, config, keep_build)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 586, in build_to
[CRITICAL] self.build_store.remove_build_dir(build_dir)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 427, in remove_build_dir
[CRITICAL] robust_rmtree(build_dir, self.logger)
[CRITICAL] File "/yellow/users/certik/repos/hashdist/hashdist/core/fileutils.py", line 89, in robust_rmtree
[CRITICAL] shutil.rmtree(path)
[CRITICAL] File "/var/lib/perceus/vnfs/asc-fe/rootfs/usr/lib64/python2.6/shutil.py", line 221, in rmtree
[CRITICAL] onerror(os.rmdir, path, sys.exc_info())
[CRITICAL] File "/var/lib/perceus/vnfs/asc-fe/rootfs/usr/lib64/python2.6/shutil.py", line 219, in rmtree
[CRITICAL] os.rmdir(path)
[CRITICAL] OSError: [Errno 39] Directory not empty: '/panfs/scratch/avol8/certik/h/tmp/cmake-7j6vg4fc4ohx-1'
[CRITICAL] This exception has not been translated to a human-friendly error message,
[CRITICAL] please file an issue at https://github.com/hashdist/hashdist/issues pasting
[CRITICAL] this stack trace.
I bet it has something to do with the logger
: https://github.com/hashdist/hashdist/blob/a1ee86476a7c3c533e47f40cf0a39e516cd9ed6c/hashdist/core/fileutils.py#L82, as I didn't see any message warning me I need to turn of tail -f
in the separate terminal, otherwise it will fail to install perfectly fine package (that BTW took forever to install, thanks to slow NFS), thanks Hashdist.