Open ltnetcase opened 8 years ago
Our storage expert says that there may exists some process delete and recreate a single file in 'bpipe' workflow, which lead to the "stale file handle" error on our NFS storge server.
@ssadedin would you please point me to the process which delete and recreate files in your workflow, or would you be kindly fix this issue for more robustness of 'bpipe'.
I've been trying to think of any process that Bpipe does to delete / rewrite files and I can't think of anything. What is happening at the error point above is that Bpipe has just created the job directory (on the login node) and then tries to write a file into the new directory (also from the login node). The directory itself should be uniquely named. So there's really no opportunity for any funky delete / rewrite scenario. Perhaps we could make it more robust if Bpipe first wrote the file into an existing directory, and then moved it to the desired directory.
This kind of problem strikes me as being similar to other mysterious problems I've had when working with distributed file systems that don't achieve perfect coherence. That is, the file system is load balanced across multiple nodes, the directory creation happens on one, but the file write happens on a different node. If you are unlucky you manage to do it before the two nodes synchronise. This is all complete guesswork. If it's the case, simply introducing a delay or a retry could solve the problem.
Thanks @ssadedin, you're right, maybe that's the case. The directory creation on nfs using groovy's method may sometimes not be fast enough to let a new file creation under it immediately, or maybe it should count on parallel execution which increase the nfs client load. However, we tried to create those tmp dirs before bpipe started. Recent tests showed that it worked. We haven't test if adding a delay between dir creation and file creation, maybe 2 seconds, also can solve this problem. May it a good idea to let bpipe more robust.
I have no issues with bpipe running on NFS v4.1 filesystems for what it's worth.
@ssadedin I use bpipe-0.9.9 tar ball, and have changed the file: 'templates/executor/sge-command.template.sh' :L2 from
\$ -wd ${bpipe.Runner.canonicalRunDirectory} to #\$ -cwd
because the former one did not work correctly.
And now things work as expected in our oracle grid engine environment except rarely stale file handler errors. We use bpipe in a shared nfs disk mounted on all compute nodes.
Is it possible for bpipe to write or operate some temp files from different compute nodes? I haven't get down to look into the code, just a guess. Because the same pipe without grid engine environment works correctly all the time. What do you think?
Thanks. Here's the error log of the problem job: