radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 31 forks source link

Process disk use and run-time monitor #1312

Open robnagler opened 6 years ago

robnagler commented 6 years ago

1234 will allow us to constrain docker containers in many ways, but not with run-time or disk use.

Docker allows us to constrain write speed, which may be important, because we shouldn't allow processes to write GBs of data as fast as they can. It just doesn't make sense.

We can build a monitor of disk use in the container for the report directory that runs periodically (seconds) to determine if a report is getting "too large" (need to determine this limit).

robnagler commented 6 years ago

storage-opt allows you to control the size of the docker file system (not mounted volumes). You have to combine this with quotas on the fs inside the container.

This option would be useful if we only wanted to allow small (<1G) reports. There would have to be a transfer at the end of the run. If it were large (10G), that might take too long, especially since the network transfer would happen at once at the end.

Yet another option would be to mount a fixed sized tmpfs volume to run the reports. Again, small sizes would be required.

bruhwiler commented 6 years ago

We do not want to slow down IO for HPC applications. However, it might sense on a shared resource like the Sirepo beta server.

Constraining disk usage for a simulation is different -- every HPC system will have some limit on disk usage.

robnagler commented 6 years ago

When we show an animation, it's because there's data being written frequently. Sirepo could have a batch mode (and that's on the schedule anyway), which should reduce the disk writing frequency to something "reasonable".

This means we need to understand the output structures for the codes we support. This is tricky stuff, but it's necessary. There must be ways of checking parameters to guess how much data will be written if interactive or batch.

One of the problems we have is that we don't have the power of Unix users like NERSC does. We don't have to be "webscale", but we would have to implement a disk quota system backed by Unix users, which would be awkward at best.

The architecture we have now doesn't support constraints. It also requires the data to be sent over the wire, unnecessarily. What I would ultimately like to see is a container per user and all operations would happen in the container, even the graph returns. #1234 goes a long way towards that. However, the rest of the server is very much file-oriented and would require a refactoring to decouple all the code operations. I think this would be a good idea, but it's a lot of work.

For supercomputers, we will be forced to update the architecture. The remote agent will have to do all the rendering and communicate requests with messages back to the server which would forward them on to the GUI.

robnagler commented 6 years ago

I was not clear about the proposed "new" architecture. With a message based system, we would have agents monitoring codes. With an sbatch system, that would be an agent watching files and the queue. With a docker-based (#1234) cluster system, we would start the containers in a constrained way, e.g. with 10GB of disk allocated to the user on the remote machine. Results in either case would be processed by the agent and returned to the main server. The full data files would never go over the network unless the user downloaded them directly. We could also have the agent compress results files after the simulation was complete. (This could be done at the hdf5 or file system level, depending on how we thought it would work best.)

robnagler commented 3 months ago

I think we should just have the agent poll the run directory. If it crosses a threshold, kill the simulation with an appropriate user alert. Free users would have a different level than premium. du runs pretty fast on a single directory.