threefoldtech / test_feedback

Apache License 2.0
3 stars 0 forks source link

Workloads not surviving node restarts. #329

Closed Parkers145 closed 1 year ago

Parkers145 commented 1 year ago

Yesterday it was requested that I restart the dev net farm, when I did so, the vm for the publicuptimetfcloud.us/uptimeCheck did not recover post restart and the website has been down since. I have also documented that this occurs on test net aswell, https://github.com/threefoldtech/zos/issues/1823

Parkers145 commented 1 year ago

User “Teis” on telegram and forums is responsible for the vm that went down at reboot.

8091 is the contract for the public ip related to Teis' workload.

Parkers145 commented 1 year ago

Looks like user flow wolf on telegram lost a deployment that was on Dev net node 30 or 31 aswell.

Parkers145 commented 1 year ago

These are all of the Ip addresses i show contracts assigned to on devnet farm 49, that have IP addresses that are not in a up state at the gateway. all of the deployments related to these contracts public interfaces are in down state.

162.205.240.245 8091 162.205.240.246 12972 162.205.240.248 13084 162.205.240.249 10815 162.205.240.240 13298 162.205.240.243 13596 162.205.240.235 13844 162.205.240.236 14241 162.205.240.237 14171 162.205.240.239 9940 162.205.240.231 14205 162.205.240.232 14014

despiegk commented 1 year ago

I had issues on devnet too yesterday, it might be that there are still some serious issues wrong on devnet I know we are changing a lot in this 3.8 build it might be a too early build for the average tester

Parkers145 commented 1 year ago

Got home and was able to get hands on 49, i had restarted it remotely, it was booted to a blank black screen, did another reboot, still black screen, I pulled it out, reseated everything and swapped to a new bootstrap usb, and its back up now it looks like most of the contracts are showing up links now aswell. im not sure what happened here i saw it online in the explorer post restart friday night, but ive never seen that black screen unless it fails to boot.

after fixing 49 being down these contracts show public ips that arent deployed on the network

162.205.240.232 14014 - 162.205.240.239 9940 - 162.205.240.237 14171 - 162.205.240.235 13844 - 162.205.240.248 13084 -

muhamadazmy commented 1 year ago

Here are my findings:

muhamadazmy commented 1 year ago

We will still run some tests to make sure workloads are fully restored after a node reboot

mohamedamer453 commented 1 year ago

Verified on devnet.

I deployed multiple different workloads(Full VM, Micro VM, K8s Cluster, Mastodon instance) with unique data on each workload on nodes 33, 34 then the nodes were restarted and the workloads were not affected. I checked the resources and data of each workload and everything was fine.

Full details before and after the restart can be seen in this test run: https://app.testlodge.com/a/26076/projects/40893/runs/668672

The only weird thing i noticed was that the created timestamp was changed after the restart so i created a separate issue with it. #337