simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Non-interactive nodes has no HOME #4

Closed ickc closed 5 months ago

ickc commented 1 year ago

Currently, interactive node has a HOME, pointing to the scratch directory. But non-interactive node, such as those sent to vanilla universe or parallel universe, has no HOME defined. Defining one manually seems to be overridden inside subprocesses (i.e. may be some system level rc scripts un-define it?).

This breaks some scripts that assumes the presence of HOME, e.g. mamba.

This then makes it difficult to submit jobs that worksas CI (Continuous Integration) or some other routine compilation work (recall that the login node cannot be used to compile things, as it does not have access to modules for example and has no gcc compiler.)

Having a equivalent of export HOME=$_CONDOR_SCRATCH_DIR is good enough for our purpose. It should not cause (any more) confusions as the interactive node is already like this.

rwf14f commented 1 year ago

Do you have more information on your test jobs, because I can't reproduce this. When I submit a non-interactive simple "echo $HOME" script to both universes then it prints the location of my actual home directory (not the htcondor scratch as it does for interactive jobs).

ickc commented 1 year ago

@rwf14f, I'm guessing you have getenv = true in your ClassAd which passes the environment of the submission node to the job?

ickc commented 11 months ago

@rwf14f, did you get a chance to look into this? Thanks.

ickc commented 6 months ago

@rwf14f, is there recent change about this? I'm a bit confused about what I get today:

On wn3805341.tier2.hep.manchester.ac.uk, it seems I can access /home/$USER and can even have persistent storage there (i.e. files persist across jobs.) It seems some sort of mdraid is setup where md2 maps to /home.

rwf14f commented 6 months ago

No change there, this has always been the case. The pool user accounts for the grid jobs are being created under /scratch, but yours are in /home because we currently use the same account creation mechanisms as for our local admin accounts. Files in there persist across jobs, but space is limited (/home has less than 100G). This is not a shared file system though, it's local to the machine, so files you put there are only available on that machine only and not on any of the others. This will go away when we set up a different authorisation/authentication mechanism as to what we currently do, so don't rely on it.

ickc commented 6 months ago

Thanks. That's a bit surprising. I think the main problem is unpredictability. Either of these are predictable behavior:

But the current situation is

Now once a job is submitted (say with no particular constraint on hostname), what can be expected from home is actually undefined, which means extra logic may be needed if HOME is used.

This behavior (that HOME is persistent per node) is not what I observed in the past either, which possibly is related to when those compute nodes are updated, or which nodes they are. What I observed is, including for at least a subset of the nodes currently (possibly related to the interactive flag), the HOME is the same as the scratch directory, which is guaranteed (I guess?) to be non-persistent across jobs.