mamba-org / mamba

The Fast Cross-Platform Package Manager
https://mamba.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.86k stars 353 forks source link

Package cache questions #2466

Open katringoogoo opened 1 year ago

katringoogoo commented 1 year ago

Hi! I'm working in a HPC center and we are currently trying to setup micromamba in a shared multiuser environment.

Since we have a lot of users with conda installations we want to share as much as possible and were thinking about utilizing a global pkg cache. Users creating a new environment would then a) not need to download the conda packages again and b) would not need to install them to their home directories and could use links instead.

What would work at the moment is that we make use of the conda pkgs_dirs configuration variable and point all users to a non writable package cache as first entry. However since users dont have the rights to write to the package cache location the cache can only fill if we do administrative installs of e.g. common used packages first. So if a user installs a new package that hasn't been downloaded by us previously they would get it in their home directory again.

Is there anyone out there with a similar problem/setup and has an idea of how this would be solveable?

jonashaag commented 1 year ago

Did you research similar issues in the issue tracker?

jonashaag commented 1 year ago

I wonder what’s a safe way to fill that cache because you’d have to lock everyone out of access during filling of the cache.

I’ve seen examples of successful use of a proxy that does caching, so you’d not use Mamba‘s caching feature but the one from an HTTP proxy specifically made to proxy Conda channels.

katringoogoo commented 1 year ago

I'm not that much concerned about locking people out for a bit (environment setups don't happen that often) but ideally it would just use the lock mechanism in a central storage location, download the packages and update the cache etc ...

Thanks for the proxy suggestion. However I also want to leverage the central storage location of the packages so the user environments just need to link the packages in from the common location. Or do you think that would be possible with a proxy like this?

jonashaag commented 1 year ago

Do you mean hardlink/softlink?

katringoogoo commented 1 year ago

Yes exactly.

jonashaag commented 1 year ago

Is that even a good idea on a shared filesystem?

Does the filesystem offer locks in the first place? (With the right permissions)

katringoogoo commented 1 year ago

I'm gonna talk to our filesystem specialist again but I don't expect it to be much of a problem. All of our users data uses the same high performance storage system (ibm spectrum scale) as well. So yes locking is also possible.

In general I'm really open to different solutions as well. Maybe somebody also has some experience with such a solution?

jonashaag commented 1 year ago

Generally I think your request and proposal to have a read only location is sensible. Do you know what's Conda's behaviour here?

katringoogoo commented 1 year ago

regarding conda i don't know tbh - have to test this yet.

read only central cache

i'm still thinking of a solution and at the moment the one that will definitely work with micromamba (in the soon to be released version) is

when a user now creates a new conda environment it will use the envs path in the homedir and consider the existing pkgs in the central package cache when solving/downloading. so this should work as expected for me. however this means that we will need to update that central cache once in a while or else users will re-download new packages to their homefolders.

updatable central cache

apart from that i also thought about using the suid/sgid bits so users will automatically be able to update the central cache everytime they install something:

variant a:

this has the effect that a user can download packages to the central cache, however the owner of the packages and the cache will be that of the executing user but the group on the created files/folders only has read rights. this creates a problem for the next user which cannot change those files anymore from the 'conda' group

variant b:

if i use variant b the pkgs and cache files would have the right owner & permissions but the conda env in the home dir of the user would then owned by the user 'conda' (if the permissions even allow it to write there) and the user does not have the rights to change it anymore (only via the tool) - which is also not very nice.

variant a could easily be solved if micromamba would have an option to set the write rights for the group on the contents of the central 'pkgs' folder (packages & cache) as well, so the next user would be able to write/update the cache files as well...

is there such a thing?

jonashaag commented 1 year ago

I don't know, I feel like actually implementing this properly will be though and make the code very complex.

Which makes me come back to this:

I’ve seen examples of successful use of a proxy that does caching, so you’d not use Mamba‘s caching feature but the one from an HTTP proxy specifically made to proxy Conda channels.

I might actually be able to share some code that implements this :)

katringoogoo commented 1 year ago

I’ve seen examples of successful use of a proxy that does caching, so you’d not use Mamba‘s caching feature but the one from an HTTP proxy specifically made to proxy Conda channels.

I might actually be able to share some code that implements this :)

Thanks for your answer! yes in general it sounds interesting but this won't allow us to use symlinks to reduce filesizes in users envs, or does it?

jonashaag commented 1 year ago

Yeah unfortunately. Can you buy more disk space? (This might seem like super unhelpful advice but some people never consider this. Disk space is very cheap)