radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 32 forks source link

getApplicationData via job_api #2201

Closed moellep closed 2 years ago

moellep commented 4 years ago

When calling into an app using getApplicationData or running a fixup, any code specific calls need to be run outside of the main webserver. For example, all srwlib calls in template/srw.py should get moved into a separate module which can create a subprocess to execute. Consolidate srw fixups into one routine for clarity, possibly always run it as a subprocess.

robnagler commented 4 years ago

To be clear, it needs to run in pkcli.job_cmd, which runs in a separate docker container so that there are no privilege issues and so that the main web server doesn't block a thread for an extended period of time.

robnagler commented 4 years ago

I think in the end, we will need to keep fastcgi, which is "pretty fast". Application data is just another OP_ANALYSIS so it will run over the same channel. Perhaps we might even have simulations started this way, because you can't run a sequential simulation at the same time as a simulation frame. They all run in the same docker container which is assigned one core so there would be no point in overloading it.

e-carlin commented 3 years ago

There has been some email and in person dicussion about this issue. Putting the notes from those conversations here.

We've decided to try and categorize the types of get_application_data calls and to use new apis to incrementally move over.

I did an initial categorizing of all of the existing get_application_data calls.

@robnagler responded with:

I think the calls need to be categorized as follows according to "level": A. statelesCompute - Purely functional/stateless but requiring access to the code. An agent has to be running, but it does not go to NERSC/sbatch, because it is just getting access to the code, not to something about the data/compute. B. simulation_db - read the simulation directory that contains sirepo-data.json or lib directory but not runDir C. sim_db - access to the runDir and/or runDirs of other codes. This may requiring hitting NERSC/sbatch agents D. analysisJob - something that reads/writes from an existing runDir(s) E. computeJob - something that rewrites the runDir

(A) pass parameters and get results. No reference to any external data.

(B) needs to run in Flask, because it does not need access to a code, and all the files reside there. Hopefully this request is fast, but it if it isn't, we would need to optimize by using a better database.

(C) can also get information from (B), but it doesn't write anything, just read. It might get access to all runDirs, e.g. jspec's getting twiss parameters. I make the distinction on writing, but maybe that's not a good distinction. It might be that (C) and (D) are the same.

e-carlin commented 3 years ago

@robnagler Mostly makes sense. Some questions:

A. statelesCompute - Purely functional/stateless but requiring access to the code. An agent has to be running, but it does not go to NERSC/sbatch, because it is just getting access to the code, not to something about the data/compute.

To be clear, it doesn’t go out to NERSC/sbatch agent but it does go to a docker/local agent correct?

I make the distinction on writing, but maybe that's not a good distinction. It might be that (C) and (D) are the same.

I don’t know that the distinction is necessary. Can you elaborate more on why it may be necessary?

What category does the calls to epics that write out to a runDir fall into? (D)? How about a call to epics that just reads data from epics but doesn’t touch the runDir?

Likewise, what about calls to the crystal server? Nothing is read/written from the db so maybe it is functional but it won't be fast.

robnagler commented 3 years ago

To be clear, it doesn’t go out to NERSC/sbatch agent but it does go to a docker/local agent correct?

Yes.

I don’t know that the distinction is necessary. Can you elaborate more on why it may be necessary?

I was thinking that (D) analysisJob writes files (sometimes) and (C) sim_db doesn't. Probably simpler to merge the two cases.

Here's an update to the categories:

A. statelessCompute B. simulation_db C. analysisJob D. computeJob

(D) creates a runDir or needs tmpDir. That distinction is pretty arbitrary, but it's certainly something quite distinct, and easier to reason about. (C) reads 1 or more runDirs and anything (B) can read. (B) only reads from "not yet run" simulation data. (A) all data is contained in a request or the already in the running container (e.g. flash).

Flash compilation is (D) now, I think, since it needs a runDir or tmpDir.

There's also the question of sequential or parallel, but I think that's already "solved". I think computeJob is the only thing that should run parallel (as another distinction), that is, (A)-(C) are only given one core.

What category does the calls to epics that write out to a runDir fall into? (D)? How about a call to epics that just reads data from epics but doesn’t touch the runDir?

I don't understand exactly the difference exactly, but does it need a new runDir or does it reuse data from another runDir?

Likewise, what about calls to the crystal server? Nothing is read/written from the db so maybe it is functional but it won't be fast.

Speed is not the issue. That's handled through spinners and buttons in the GUI. The GUI should always assume that something the server does takes some time. The server might be down or just starting or busy.

e-carlin commented 3 years ago

Makes sense, thanks.

What category does the calls to epics that write out to a runDir fall into? (D)? How about a call to epics that just reads data from epics but doesn’t touch the runDir?

I don't understand exactly the difference exactly, but does it need a new runDir or does it reuse data from another runDir?

In the first case (webcon.update_kicker) we create a run dir and save some data to it. In the second case (webcon.read_kickers) we don't interact with the runDir at all, just call out to the epics server. So, I think for the first case is (D). And the second case is (A). Agree?

e-carlin commented 3 years ago

I've made an architecture doc outlining what we've discussed. I'll update it as I work on this.

robnagler commented 3 years ago

So, I think for the first case is (D). And the second case is (A). Agree?

Yes, a pure call to epics is like the crystal example.

I think though, there's a slight problem here. If there's a "fake" epics server, then it will be running in an existing container. If it is a real epics server, then access will need to be authorized, and that implies some state checking (simulation X has the rights to epics variables Y or somesuch). For the crystal example, the server is public on the Internet so that's why it is truly stateless.

e-carlin commented 3 years ago

Yes that's interesting. I don't exactly understand how epics works but from the look of the current app the user just supplies the address to the epics server so I suppose they are public too.

Any preference on how I begin working on these? I think working down the list from A-D will probably be easiest.

robnagler commented 3 years ago

EPICS has no security afaik. I think we should provide some, but maybe this is a red herring case. In the case of our test environment, we can generate random addresses or whatever to ensure people can't guess. Not important.

Yes, work on (A) first. New API that is purely stateless. I think it should work in the analysis slot, i.e. you don't get a new agent. The trick, of course, is if the simulation is configured for sbatch, it should run to the local docker agent.