radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 31 forks source link

start agent when visiting the simulations list #7319

Open moellep opened 1 week ago

moellep commented 1 week ago

This is a regression of #4230 - the agent is no longer getting prestarted when someone visits the simulation list. This causes a 10 second delay when opening a simulation while waiting for the agent to initialize. See also #7318

e-carlin commented 1 week ago

I'll handle this

e-carlin commented 1 week ago

@moellep I want to confirm you are only seeing this after the agent has been idle for too long and killed? If I restart the server and visit the sim list then I see an agent is started. If I do that and then wait for the agent to be killed after being idle for too long and then refresh the sim list then no agent is started.

e-carlin commented 1 week ago

Looking at the code I realize now that a new agent isn't started because of the interval for sending beginSession requests. In the default config that is 5m and the idle_check_secs is 1800s. So assuming a default config (which we are running in prod) then (I think) those two number should play well together. For testing I reduced idle_check_secs to 10s and _REFRESH_SESSION to 5s, started an agent by visiting the sim list, waited for the agent to be killed, refreshed, and a new agent was started.

@moellep can you help with steps to reproduce the problem?

moellep commented 1 week ago

Ah - I think I'm seeing something different. Yes, I see log messages that the agent has started, and when I visit a SRW simulation it is immediately available, so this is not a regression.

I'm seeing an initial delay in apps like openmc where the initial page (geometry in this case) waits for the serverStatus reply until it renders the page.

The initial call to serverStatus (after a dev restart, or after the agent has been stopped on prod) takes around 10 seconds, but only the first visit. If I visit a different sim, it replies very quickly.

e-carlin commented 1 week ago

Ok, that makes sense. First visit we have no option but to wait for the agent to start again if it has been killed (idle timeout or server restart)

I think this can be closed but lmk if there is still something to be solved.

moellep commented 1 week ago

I was hoping we could somehow preload the 10 seconds delay for the serverStatus. Something is getting initialized there which isn't related to starting the agent (maybe the fastCGI?). I only see the delay when visiting the first sim, so maybe that initialization could be included with beginSession.

robnagler commented 1 week ago

I have a fix for this in 6784-fastcgi-improve. There are multiple problems.

e-carlin commented 1 week ago

Thanks Rob. Let's talk about this when you're back and I can look into finishing the work on your branch.

From my read of the problem: The problem is, as @moellep expected, waiting on fastcgi. openmc does a statefulCompute before initSimulation. That statefulcompute depends on fastcgi which must be started and replied to before we can get the status back. All of the round trips and starting fastcgi take time. I think we could pre start fastcgi as part of beginSession. But, I need to think about that a little harder before I'm convinced. Probably makes sense to fold it into Rob's work.