[Change Request] Investigate slow server and nodes going offline

bartvanb commented 1 month ago

Description

running nodes seems to consume much CPU of the workshop server (unlike for other v6 servers)
Nodes go offline without apparent reason after ~24 hours (unlinke other v6 servers)
server is slow

Maybe these are related

frankcorneliusmartin commented 1 month ago

During the workshop we had 2 instances on the same appservice as or other servers.

I moved the workshop server to its own app service so we could scale independently. I could not find an explanation for the high CPU load as I do no longer observe this. It might be that a lot of nodes had central tasks polling (Although these request are very light weight)?

What I did

I started using a DEV App Service plan with 1 instances, which gave me a similar crappy experience as during the workshop. Starting the nodes took several minutes and during this time the UI was unusable. Sending tasks worked, but was very laggy.
I then upgraded to DEV App Service plan with 1 instances, which was an upgrade (i rebooted all the nodes to make sure all instances where assigned to different instances). The experience improved somewhat but was still crap
Then I upgraded to Premium v3 P3V3 with 3 instances, this was a significant improvement from my end. It only got laggy when I started +- 10 tasks.
Then I upgraded to Premium v3 P3V3 with 6 instances, this worked very well. I send batches of 10 (inifite central) tasks and the UI slowed down a little but it still felt oke-ish.

Of course we have ~30 participants, that means about 30 nodes + 30 open UIs... That still would be a challenge, we can scale to 30 instances. But we also should consider the costs at this point (30*450 / 31 = 435 euro's per 24h), if we are conservative we can probably use it for about 18h which is about ~326.

Recommendations

We need to remove tasks after each lessen to keep things clean. As we reboot nodes we also start all subtasks again which is a waste of our 'request' power at the server. Also central tasks who are polling indefinitely should be avoided and killed. We can use [client.task.delete(task.get("id")) for task in client.task.list(per_page=999)["data"]] to do so.
We should consider not letting all participants send tasks at the same time (maybe we can use the groups?)
We probably want to start the nodes in batches.
Maybe lower the poll frequency of central tasks
Scale up right before the workshop (then start the nodes) and scale down at the end of the first day to save costs

frankcorneliusmartin commented 1 month ago

The only thing we have for the shutting down nodes are the following logs... It seems to get a normal kill signal:

2024-09-20 20:57:00 - node           - INFO     - Node is interrupted, shutting down...
2024-09-20 20:57:00 - socket         - INFO     - Disconnected from the server
2024-09-20 20:57:01 - socket         - INFO     - Oak_Date left room collaboration_104
2024-09-20 20:57:01 - network_man..  - DEBUG    - Disconnecting vantage6-Oak_Apple-user from ne
twork'vantage6-Oak_Apple-user-net'

I also see the nodes frees memory up at this time, but is definitely not out of it:

The machine has no swap memory at all.. so that might be the cause

frankcorneliusmartin commented 1 month ago

Ok I think I found the cause. At the time the nodes shut down Azure applies updates to our VM. Not sure why it kills our containers.

It seems like we are not able to control this process as this is a 'Azure Managed - Safe Deployment'

I suggest we take the risk. Worst case we would need to reboot the nodes.

frankcorneliusmartin commented 1 month ago

We are going with this plan.

vantage6 / vantage6-workshop

[Change Request] Investigate slow server and nodes going offline #137

What I did

Recommendations