TestID/Instance Remaining After Client Abruptly Shuts Down

sn3kyJ3di commented 1 day ago

I noticed something today...I am not sure if its a bug or not:

1) Clone Repo 2) Run "docker compose up web worker provision" 3) Have a client make a connection and start advancing the testcase 4) Shut client down 5) Start client back up from the beginning (post the same testcase and get new testid)

Result is no connection to the container. If I stop the container with "docker compose down" and bring it back with "docker compose up web worker provision", I see in the console what looks like those old connections that have been abandoned. The only way I can get it to work again is to wipe everything on the service, reboot and start over. If I run "docker compose up web worker provision --scale worker=X" I can make multiple connections from the same client, that helps but I might eventually get back to the same problem.

Do these connections time out eventually? Would it make sense to have a POST command to flush all connections I could run prior to my client setting a test case and getting a new testid? I see this call:

PUT stop/{testid} - it seems to me this is useful on an orderly shutdown only, if my client abruptly stopped and I do not have the testid - it seems that this call is not as useful. Am I missing something? I am happy to show this all live if it helps.

-Aaron

dhblum commented 1 day ago

Thanks @sn3kyJ3di for reporting. A couple comments/questions:

The PUT stop/{testid} indeed needs to be invoked between your steps 4. and 5. if your number of running test cases is equal to X in --scale worker=X (or 1 if don't use that command). Doing that should free up a worker so your 5. works ok. Is this true for you?
Restarting the service by "docker compose down" followed by a "docker compose up web worker provision" between your 4. and 5. should also work ok to be able to execute 5. The first command shuts down all services, removes the images, and removes the network. The second command starts it all up again. Effectively, this should serve as a reboot. Is this true for you? I'm not quite sure what you mean by "The only way I can get it to work again is to wipe everything on the service, reboot and start over."
A running test case does timeout eventually if it doesn't get called to for some time. I think the timeout is 15 minutes (@kbenne is this correct?). This timeout will effectively invoke the PUT stop/{testid} for that test case. So, it should free up the worker again to be used to start a new test case.
To your point, "Would it make sense to have a POST command to flush all connections I could run prior to my client setting a test case and getting a new testid?", it sure is quite possible to lose a testid if a client shuts down abruptly, or probably other cases. @kbenne Does it make sense to add such a request? One caveat though is that if multiple users are using the Service, we don't want one user to be able to stop a test case being run by another user. So maybe there would need to be some authorization aspect to this.

sn3kyJ3di commented 1 day ago

"The only way I can get it to work again is to wipe everything on the service, reboot and start over." = docker compose down, delete images, reboot the host, run docker compose up web worker provision again. Allow me to try that again in the morning, my head is mush at the moment...but based on your response "docker compose down" should wipe out those previous sessions, no?

dhblum commented 1 day ago

Ok let me know the result. Yes, "docker compose down" should wipe out the previous sessions.

kbenne commented 17 hours ago

Indeed the default timeout is 15 minutes. You can change this by setting the environment variable BOPTEST_TIMEOUT to the number of seconds you want the timeout to be.

For exampleexport BOPTEST_TIMEOUT=60 for a one minute timeout.

This is an idle timeout. It wont interfere even if your steps take longer than the timeout.

I don't think there's too much downside to "overprovisioning" the number of workers using the scale command that @dhblum mentioned. You can have extra workers ready to go and if they aren't running simulations they wont tax your system much.

sn3kyJ3di commented 16 hours ago

I re-ran everything and "docker compose down" does flush everything. It seems my issues were user misunderstanding, thank you for the info on the time out.

dhblum commented 16 hours ago

Great, thanks for letting us know @sn3kyJ3di and indicating this issue is resolved.

dhblum commented 16 hours ago

Actually, @kbenne this makes me think. Are there other options like BOPTEST_TIMEOUT that would be helpful for a user to know and access? Perhaps we should provide documentation for these, e.g. in the README or user guide?

kbenne commented 16 hours ago

I think it would be good to call out BOPTEST_TIMEOUT in the README. Looking at the other variables in the .env file, I'm not sure there is more that should be elevated to the README. Perhaps the default user credentials, but I think we might want to hold off on that until we have more to say about users / accounts in general.

ibpsa / project1-boptest

TestID/Instance Remaining After Client Abruptly Shuts Down #715