radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 31 forks source link

run bluesky simulation without agent #5471

Open moellep opened 1 year ago

moellep commented 1 year ago

Bluesky would be much faster avoiding the agent and return the result data file directly from the call.

robnagler commented 1 year ago

I think you mean using sirepo.lib, correct?

moellep commented 1 year ago

Because sirepo is usually run in a container in this case, I think adding a special runSimulationBluesky API would work well. It would avoid the agent, polling, and return the result file directly from the call. I'll have a draft PR with some of the ideas very soon.

robnagler commented 1 year ago

Not happy about this, but understand the use case.

I don't think it should not be called runSimulationBluesky. I think even our "bluesky" auth is confusing. I would prefer simulateAndReply or something that is active and contains the context of what is actually happening.

The API should be in a separate module that can be added to api_modules. It should assert "bluesky" auth login mode. IOW, it should be very restricted.

robnagler commented 1 year ago

Can't let this go. There has to be something wrong with the configuration. When I go to alpha, I can run simulations repeatedly with no start up lag.

I suggest the following parameter updates:

export SIREPO_JOB_DRIVER_LOCAL_SLOTS_PARALLEL=8
export SIREPO_JOB_DRIVER_LOCAL_SLOTS_SEQUENTIAL=8
export SIREPO_JOB_DRIVER_IDLE_CHECK_SECS=1d

I am assuming that this particular configuration is with local drivers. If that's wrong, we can just bump the number of docker instances. You can oversubscribe a machine as long as the users are not all active at the same time. The agent processes don't consume resources. job_cmd consumes the resources, and it runs in a separate process.

By setting idle_secs to one day, we avoid shutting down agents every half hour (default). I am assuming there are a handful of users who are running simulations sporadically. If there are many users running lots of simulations all the time, these numbers might be wrong.

Please try this, and I'm happy to debug this. Agent start time, especially local agent start time, should not be a hinderance or sirepo.com wouldn't work as well as it does.

moellep commented 1 year ago

The JOB_DRIVER settings didn't appear to help performance. I've added a draft PR with the changes which run bluesky sims outside the agent. Performance is 3x faster for short sims like the Shadow BeamStatisticsReport. These simulation runs in around 1.5 seconds. The old interface has over 4 seconds of overhead, including a separate call to getDataFile.