Related to #327 and #266, but unfortunately, can't be handled the same, that is, if we want to be able to executed codes on arbitrary supercomputers.
NERSC is the current example, but no supercomputers would allow Sirepo (even on-premise) to have supervisory control of the execution of containers. In a Sirepo Docker environment, the Job Supervisor has privileges outside of the Docker containers. This allows it to select the Docker image to be run, pull it, and then start the Job Agent inside that container. The Job Agent runs only with the privileges of the authenticated Sirepo user.
On NERSC, we do not have that control. Although we can download images from private repos, the NERSC end-user (proxied via the authenticated Sirepo user) would have privileges to all images in a private repo so that wouldn't allow user X to run code P but not code Q. The control needs to be at the image, not repo level. This potentially could be solved similar to how we provide access to library files, but it would require quite a bit of work to make SHIFTER behave properly.
Since we build our images with RPMs, we can dynamically install an RPM into the running container. This would allow access to the code as long as the container was running, but not when it wasn't. Since we also build our RPMs so that an ordinary user can install them, because they do not need access to root file systems.
Unfortunately, there's still an issue with SHIFTER, because all container file systems are read-only except the user's home and scratch directories. This means the codes need to be relocatable (which most are/can be, certainly the proprietary ones we've seen). The reason SHIFTER makes the container's file systems read-only is (probably) to reduce confusion in multi-node execution contexts, that is, running on a supercomputer. In this environment, anything that gets written, has to be written to a shared file system so that it is visible to all nodes.
226 comes into play, because now we are keeping track of codes individually instead of collectively in the image. This is a relatively easy problem to solve.
However, given that the codes have to be written to a shared file system, RPMs may not be the best deployment vehicle. It's likely that the job agent will do the installing before starting job cmd. Job agent runs natively on the supercomputer vs inside the Docker container on Sirepo. On the supercomputer, the rpm program may not be available, or cpio might not be. Maybe. This needs to be thought about, which is why I'm writing this.
The flip side is that we can get Docker going relatively quickly, because we control the installation, and all the files are written to /home/vagrant, currently so they would work with the existing shell init files and user permissions inside the container. We could implement this first.
Related to #327 and #266, but unfortunately, can't be handled the same, that is, if we want to be able to executed codes on arbitrary supercomputers.
NERSC is the current example, but no supercomputers would allow Sirepo (even on-premise) to have supervisory control of the execution of containers. In a Sirepo Docker environment, the Job Supervisor has privileges outside of the Docker containers. This allows it to select the Docker image to be run, pull it, and then start the Job Agent inside that container. The Job Agent runs only with the privileges of the authenticated Sirepo user.
On NERSC, we do not have that control. Although we can download images from private repos, the NERSC end-user (proxied via the authenticated Sirepo user) would have privileges to all images in a private repo so that wouldn't allow user X to run code P but not code Q. The control needs to be at the image, not repo level. This potentially could be solved similar to how we provide access to library files, but it would require quite a bit of work to make SHIFTER behave properly.
Since we build our images with RPMs, we can dynamically install an RPM into the running container. This would allow access to the code as long as the container was running, but not when it wasn't. Since we also build our RPMs so that an ordinary user can install them, because they do not need access to root file systems.
Unfortunately, there's still an issue with SHIFTER, because all container file systems are read-only except the user's home and scratch directories. This means the codes need to be relocatable (which most are/can be, certainly the proprietary ones we've seen). The reason SHIFTER makes the container's file systems read-only is (probably) to reduce confusion in multi-node execution contexts, that is, running on a supercomputer. In this environment, anything that gets written, has to be written to a shared file system so that it is visible to all nodes.
226 comes into play, because now we are keeping track of codes individually instead of collectively in the image. This is a relatively easy problem to solve.
However, given that the codes have to be written to a shared file system, RPMs may not be the best deployment vehicle. It's likely that the job agent will do the installing before starting job cmd. Job agent runs natively on the supercomputer vs inside the Docker container on Sirepo. On the supercomputer, the rpm program may not be available, or cpio might not be. Maybe. This needs to be thought about, which is why I'm writing this.
The flip side is that we can get Docker going relatively quickly, because we control the installation, and all the files are written to /home/vagrant, currently so they would work with the existing shell init files and user permissions inside the container. We could implement this first.