vantage6 / vantage6-workshop

https://vantage6.github.io/vantage6-workshop/
Other
0 stars 0 forks source link

[Change Request] Investigate slow server and nodes going offline #137

Closed bartvanb closed 1 month ago

bartvanb commented 1 month ago

Description

Maybe these are related

frankcorneliusmartin commented 1 month ago

During the workshop we had 2 instances on the same appservice as or other servers.

I moved the workshop server to its own app service so we could scale independently. I could not find an explanation for the high CPU load as I do no longer observe this. It might be that a lot of nodes had central tasks polling (Although these request are very light weight)?

What I did

  1. I started using a DEV App Service plan with 1 instances, which gave me a similar crappy experience as during the workshop. Starting the nodes took several minutes and during this time the UI was unusable. Sending tasks worked, but was very laggy.
  2. I then upgraded to DEV App Service plan with 1 instances, which was an upgrade (i rebooted all the nodes to make sure all instances where assigned to different instances). The experience improved somewhat but was still crap
  3. Then I upgraded to Premium v3 P3V3 with 3 instances, this was a significant improvement from my end. It only got laggy when I started +- 10 tasks.
  4. Then I upgraded to Premium v3 P3V3 with 6 instances, this worked very well. I send batches of 10 (inifite central) tasks and the UI slowed down a little but it still felt oke-ish.

Of course we have ~30 participants, that means about 30 nodes + 30 open UIs... That still would be a challenge, we can scale to 30 instances. But we also should consider the costs at this point (30*450 / 31 = 435 euro's per 24h), if we are conservative we can probably use it for about 18h which is about ~326.

Recommendations

frankcorneliusmartin commented 1 month ago

The only thing we have for the shutting down nodes are the following logs... It seems to get a normal kill signal:

2024-09-20 20:57:00 - node           - INFO     - Node is interrupted, shutting down...
2024-09-20 20:57:00 - socket         - INFO     - Disconnected from the server
2024-09-20 20:57:01 - socket         - INFO     - Oak_Date left room collaboration_104
2024-09-20 20:57:01 - network_man..  - DEBUG    - Disconnecting vantage6-Oak_Apple-user from ne
twork'vantage6-Oak_Apple-user-net'

I also see the nodes frees memory up at this time, but is definitely not out of it: Image

The machine has no swap memory at all.. so that might be the cause

frankcorneliusmartin commented 1 month ago

Ok I think I found the cause. At the time the nodes shut down Azure applies updates to our VM. Not sure why it kills our containers.

It seems like we are not able to control this process as this is a 'Azure Managed - Safe Deployment'

I suggest we take the risk. Worst case we would need to reboot the nodes.

frankcorneliusmartin commented 1 month ago

We are going with this plan.