grisu48 / gridengine

Son of Grid engine
https://arc.liv.ac.uk/trac/SGE
17 stars 7 forks source link

Getting failure to stat directory for stdout #5

Open toomanycats opened 3 weeks ago

toomanycats commented 3 weeks ago

I'm tracking down a very obscure error, where about 30% of submitted jobs, go into the Eqw state. The error is always the same,

error reason          1:      08/19/2024 14:10:31 [1730373583:9092]: can't stat() "/grid_test" as stdout_path: Permission denied KRB5CCNAME=none uid=xxx gid=xxx 101 600  xxx  xxx xxx

We thought this was due to using a brand new storage appliance. However, when permissions are get wide open there's no change in the behavior. I've captured NFS traffic and been analyzing it in Wireshark. I don't see any FSSTAT failling.

I'm wondering, if the SGE daemon creates the stdout and stderr file in the sge root directory and the client then copies it out ??

Any ideas are appreciated.

grisu48 commented 3 weeks ago

At first this could be an error caused by a MAC solution. Can you check if AppArmor or SELinux could be the culprit, i.e. disabling either one of those and seeing if the error disappears.

toomanycats commented 3 weeks ago

That's a good idea but it didn't help. I set selinux into permissive mode, rebooted and received the same error. This new storage is a cluster so I was hoping that might work.

What do you think about this function: sge_filecmp in source/libs/uti/sge_io.c line 166.

/****** uti/io/sge_filecmp() **************************************************
  1 *  NAME
  2 *     sge_filecmp() -- Compare two files
  3 *
  4 *  SYNOPSIS
  5 *     int sge_filecmp(const char *name0, const char *name1)
  6 *
  7 *  FUNCTION
  8 *     Compare two files. They are equal if:
  9 *        - both of them have the same name
 10 *        - if a stat() succeeds for both files and
 11 *          i-node/device-id are equal
grisu48 commented 3 weeks ago

Not sure, but given that the error message says explicitly Permission denied I would assume the error is somewhere in the file system permissions.