hashdist / hashdist

The HashDist environment management system
https://hashdist.github.io/
Other
107 stars 44 forks source link

OSError: [Errno 39] Directory not empty #113

Open certik opened 11 years ago

certik commented 11 years ago
[ERROR] Uncaught exception:
Traceback (most recent call last):
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/_hashdist/hashdist/cli/main.py", line 136, in help_on_exceptions
    return func(*args, **kw)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/builder/builder.py", line 311, in main
    profile_aid, imports = ctx.build_all(packages, 'profile')
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/builder/builder.py", line 64, in build_all
    dep_spec, imports = visit(root_name)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/builder/builder.py", line 56, in visit
    spec, _ = visit(dep)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/builder/builder.py", line 58, in visit
    artifact_id = build_package(self, pkg, imports)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/builder/builder.py", line 121, in build_package
    artifact_id, dir = ctx.build_store.ensure_present(buildspec, ctx.config, keep_build='error')
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/_hashdist/hashdist/core/build_store.py", line 413, in ensure_present
    artifact_dir = builder.build(config, keep_build)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/_hashdist/hashdist/core/build_store.py", line 510, in build
    self.build_to(artifact_dir, config, keep_build)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/_hashdist/hashdist/core/build_store.py", line 542, in build_to
    self.build_store.remove_build_dir(build_dir)
  File "/auto/nest/nest/u/ondrej/repos/python-hpcmp2/_hashdist/hashdist/core/build_store.py", line 495, in remove_build_dir
    shutil.rmtree(build_dir)
  File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python2.7/shutil.py", line 252, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/auto/netscratch/ondrej/bld/libtool-n-6thc'

This exception has not been translated to a human-friendly error message,
please file an issue at https://github.com/hashdist/hashdist/issues pasting
this stack trace.
dagss commented 11 years ago

This is Python's shutil.rmtree failing to remove a directory. Either a bug in shutil, or something to do with permissions (the build removes write permissions for a file?), or the network filesystem acting weirdly. My guess is for the latter...and if so, it sounds like something we can't fix in Hashdist.

Is it reproducible? Could you try to shutil.rmtree the directory? What about rm -r (no -f)? What about rm -rf?

certik commented 11 years ago

I think nfs sometimes leaves files around, temporarily. So we should try yo use rm -rf or something similarly robust.

Sent from my mobile phone. On Sep 29, 2013 4:11 AM, "Dag Sverre Seljebotn" notifications@github.com wrote:

This is Python's shutil.rmtree failing to remove a directory. Either a bug in shutil, or something to do with permissions (the build removes write permissions for a file?), or the network filesystem acting weirdly. My guess is for the latter...and if so, it sounds like something we can't fix in Hashdist.

Is it reproducible? Could you try to shutil.rmtree the directory? What about rm -r (no -f)? What about rm -rf?

— Reply to this email directly or view it on GitHubhttps://github.com/hashdist/hashdist/issues/113#issuecomment-25317719 .

certik commented 11 years ago

So this error seems to be a known bug/feature of shutil.rmtree, see here:

http://code.activestate.com/lists/python-list/159050/

where they have exactly the same problem (note the comment that they switched to NFS then the problem started to occur).

The error is randomly reproducible --- i.e. it happens once in a while. It just happened after a long build of a package and I have to build it again, so this bug is extremely annoying. I'll see if I can fix it using the linux rm -rf, specifically I am testing the following patch:

diff --git a/hashdist/core/build_store.py b/hashdist/core/build_store.py
index 522e7fb..a945a1d 100644
--- a/hashdist/core/build_store.py
+++ b/hashdist/core/build_store.py
@@ -492,7 +492,7 @@ class BuildStore(object):

     def remove_build_dir(self, build_dir):
         self.logger.debug('Removing build dir: %s' % build_dir)
-        shutil.rmtree(build_dir)
+        os.system("rm -rf %s" % build_dir)

 class ArtifactBuilder(object):
     def __init__(self, build_store, build_spec, virtuals):
certik commented 11 years ago

Also relevant: http://stackoverflow.com/questions/11228079/python-remove-directory-error-file-exists, they say that this feature of NFS can't be easily fixed. So the conclusion is that we can't fail like this in hashdist.

dagss commented 11 years ago

It says it can fail due to us holding a file descriptor open. Perhaps you could poke around and see if there's any file descriptors we should close (lsof could help, or by reading the code...)

If we don't have any descriptors open, we could attempt sleeping and re-running shutil.rmtree a couple of times in a loop shrug.

ahmadia commented 11 years ago

Looks like the easybuild folks did the loop thing. https://github.com/hpcugent/easybuild-framework/pull/353

I'm thinking about what our robust options are. Using rm -rf is not portable on non-UNIX systems, so I'd prefer to handle this within Python if possible. I'll post a PR to try out in a few minutes.

ahmadia commented 11 years ago

The OpenStack folks also do a loop, but also explicitly check for stale NFS files. Let's fix with a sleep-loop for now and come back to the later if it's still a problem.

certik commented 11 years ago

+1

ahmadia commented 11 years ago

@certik - Please check if this solves this.

certik commented 9 years ago

So I got hit by this again:

certik@ml-fey2:~/repos/hashstack(moonlight)$ hit build -j8  
/yellow/users/certik/repos/hashdist/hashdist/formats/marked_yaml.py:72: DeprecationWarning: object.__init__() takes no parameters
  cls.__init__(self, x)
[cmake] Building cmake/7j6vg4fc4ohx, follow log with:
[cmake]   tail -f /panfs/scratch/avol8/certik/h/tmp/cmake-7j6vg4fc4ohx-1/build.log
[CRITICAL] Uncaught exception:
[CRITICAL] Traceback (most recent call last):
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/cli/main.py", line 202, in help_on_exceptions
[CRITICAL]     return func(*args, **kw)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/cli/main.py", line 174, in command_line_entry_point
[CRITICAL]     retcode = args.subcommand_handler(ctx, args)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/cli/frontend_cli.py", line 51, in run
[CRITICAL]     self.profile_builder_action()
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/cli/frontend_cli.py", line 108, in profile_builder_action
[CRITICAL]     self.args.k, self.args.debug)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/spec/builder.py", line 150, in build
[CRITICAL]     keep_build=keep_build, debug=debug)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 379, in ensure_present
[CRITICAL]     artifact_dir = builder.build(config, keep_build)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 554, in build
[CRITICAL]     self.build_to(artifact_dir, config, keep_build)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 586, in build_to
[CRITICAL]     self.build_store.remove_build_dir(build_dir)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/core/build_store.py", line 427, in remove_build_dir
[CRITICAL]     robust_rmtree(build_dir, self.logger)
[CRITICAL]   File "/yellow/users/certik/repos/hashdist/hashdist/core/fileutils.py", line 89, in robust_rmtree
[CRITICAL]     shutil.rmtree(path)
[CRITICAL]   File "/var/lib/perceus/vnfs/asc-fe/rootfs/usr/lib64/python2.6/shutil.py", line 221, in rmtree
[CRITICAL]     onerror(os.rmdir, path, sys.exc_info())
[CRITICAL]   File "/var/lib/perceus/vnfs/asc-fe/rootfs/usr/lib64/python2.6/shutil.py", line 219, in rmtree
[CRITICAL]     os.rmdir(path)
[CRITICAL] OSError: [Errno 39] Directory not empty: '/panfs/scratch/avol8/certik/h/tmp/cmake-7j6vg4fc4ohx-1'
[CRITICAL] This exception has not been translated to a human-friendly error message,
[CRITICAL] please file an issue at https://github.com/hashdist/hashdist/issues pasting
[CRITICAL] this stack trace.
certik commented 9 years ago

I bet it has something to do with the logger: https://github.com/hashdist/hashdist/blob/a1ee86476a7c3c533e47f40cf0a39e516cd9ed6c/hashdist/core/fileutils.py#L82, as I didn't see any message warning me I need to turn of tail -f in the separate terminal, otherwise it will fail to install perfectly fine package (that BTW took forever to install, thanks to slow NFS), thanks Hashdist.