hautreux / auks

Kerberos credential support for batch environments
Other
21 stars 18 forks source link

error: Failed to get current user environment variables #44

Closed PopiBrossard closed 2 years ago

PopiBrossard commented 4 years ago

Hi,

I'm trying to use auks with slurm, but can't make it work in a specific case. The user's home in my cluster is mounted using kerberos security, that's why auks is needed here. On a simple use, like srun ls ~ everything is fine, I can use my home using my Kerberos ticket.

But, when using "--get-user-env" with sbatch, the "_run_prolog" don't have access to the home. In other terms, the command su - my_username on the node running the job doesn't work in the context of _run_prolog, when trying to execute .bash_profile in my home. Some users of slurm needs to load their environment this way. I feels like auks load credentials only during the real job and not during "_run_prolog" even if it's required.

I can see in slurm's logs:

_run_prolog: run job script took usec=31976
_run_prolog: prolog with lock for job 978965 ran for 0 seconds
error: Failed to get current user environment variables
error: _get_user_env: Unable to get user's local environment, running only with passed environment

With a pstree -p I can see something like this:

           |-slurmd(3437)-+-su(21166)---bash(21167)
           |              `-{slurmd}(21163)

It's maybe because the su isn't launched by slurmstepd ?

Do you know how to solve this issue ? Is there a configuration parameter I missed ?

Thanks

hautreux commented 4 years ago

Hello,

Interesting. I had to propose a patch in slurm to make the surmstepd internal logic works properly with secured file systems a few years ago. Things were not called in the right order at that time and the spank stack was called after first user file accesses. The patch was accepted and the slurmstepd logic now works properly with auks for traditional jobs. I never had to deal with the get-user-env option of sbatch and so never have encountered that issue. There is a way to insert some spank code in the prolog logic but it seems (Google search only, no code review at that point) to be called after the prolog itself so most probably after the get-user-env logic too. I am pretty sure that only adding code in auks spank plugin to grab ticket in the prolog will not be sufficient, and that's pretty much all I could do from that side. You should open a bug at schedmd 's bugzilla and ask for their view on that. In the mean time, I would recommend to look at a way to get the ticket using the auks cli during the su phase. It should be possible using pam_exec and a script grabbing the ticket for the targeted user using the cli. That's what I would try to do to work around the issue (or ask the users to load their env variables by themselves in their batch script when they need to :)) HTH Matthieu

PopiBrossard commented 4 years ago

I've created a bug report here: https://bugs.schedmd.com/show_bug.cgi?id=9400

In the mean time, I'm gonna ask users to avoid using "--get-user-credentials".

If you're okay with it I'm gonna let this issue open, until slurm's dev team gave me a solution, so I could share it here if anybody face the same issue.

Thanks for your help and your advice.

PopiBrossard commented 4 years ago

After contacting slurm's support teams, they said "spank handler is called before prolog script and also before get-user-env logic", in response of your second paragraph.

I'm using slurm 17.04, and auks-0.4.4. Slurm's team says upgrading to slurm 20.02 won't change anything about my issue, and the behaviour of slurm.

hautreux commented 2 years ago

ok, was not remembering the logic this way, but I have not looked at slurmd internal actions pipeline for a while.

Closing this as I supposed you work around that with your users.