[Bug 🐞]: Cannot deploy VM in Australia, in Europe i can

sony87 commented 4 months ago

What happened?

I'm in Europe and can deploy VMs in Europe farms without issue. If i try to deploy to any Australia farm/node it fails after 10 minutes, constantly.

What did you expect?

To be able to deploy everywhere despite my location.

What browsers are you seeing the problem on?

No response

ZOS info

No response

Dashboard info

No response

weblets info

No response

Relevant log output

No response

xmonader commented 4 months ago

can you please add the farm id or the node id that you tried to deploy on?

xmonader commented 4 months ago

check 4985 and 2594, couldn't deploy on both

it kept giving Waiting for deployment with contract_id: 236293 to be ready and Waiting for deployment with contract_id: 236290 to be ready

sony87 commented 4 months ago

Nodes: 4349, 4350, on Farm "Mango Farm" most of the nodes does not work, 2595, 2596, 2636 etc....

sabrinasadik commented 4 months ago

The problem might be caused by latency to the hub. This in turn could cause the deployment to time out while it's fetching data from the hub (probably when copying a disk image from 0-fs to the local disk). If this is indeed the problem, it can be verified as follows:

check on metrics.grid.tf: you should see network usage at the time of the deployment which lasts for more than 10 minutes and can be considered slow
If you verify yourself: start a VM with a disk image which is not on the node you're deploying on.
check on metrics, you will see a relatively consistent network usage.
after a while (about 10min) the deployment will time out.
network usage will still be the same.
after some more time, the network usage will drop again (this means the disk image finished downloading).
if you now deploy the same disk image again, it should work.

I'm assuming the disk copy keeps running after the deployment time-out. If that is not the case, you'll have to redeploy a couple of times possibly, until the disk image is in the 0-fs cache completely.

If this is indeed the case, then there either needs to be a workaround in zos or the actual solution is to make sure that the hub is present in multiple geographic regions so latency is consistently low (distributed hub or some kind of cdn thing).

PeterNashaat commented 4 months ago

Deploying vm on TheBatcave farm-id 2252, node-id 4985

VM with ubuntu 22 flist which was already downloaded on the node working fine
- ZOS logs :

 [+] flistd: 2024-02-27T09:25:36Z info flist already in on the filesystem url=https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist

While deploying with nixos flist which was not used before on that node

ZOS logs :


2024-02-27 14:34:00 | [+] flistd: 2024-02-27T13:34:00Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=cloud-container:c65ef166512f3d5fe7c61fc3d8dd3c89 storage= url=https://hub.grid.tf/tf-autobuilder/cloud-container-8730b6f.flist
-- | --
  |   | 2024-02-27 14:33:57 | [+] identityd: 2024-02-27T13:33:57Z info checking for update after milliseconds wait=4440000
  |   | 2024-02-27 14:33:57 | [+] identityd: 2024-02-27T13:33:57Z info checking if update is required current=3.9.0 latest=3.9.0
  |   | 2024-02-27 14:33:56 | [+] flistd: 2024-02-27T13:33:56Z info starting g8ufs daemon args=["--cache","/var/cache/modules/flistd/cache","--meta","/var/cache/modules/flistd/flist/fa05b43ad1c5362453cb70de7cea9664","--daemon","--log","/var/cache/modules/flistd/log/fa05b43ad1c5362453cb70de7cea9664.log"] storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist
  |   | 2024-02-27 14:33:54 | [+] flistd: 2024-02-27T13:33:54Z info request to mount flist storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist
  |   | 2024-02-27 14:33:54 | [+] flistd: 2024-02-27T13:33:54Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=604-240316-thebatcavetest2 storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist

<img width="1318" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/190e3e58-bc44-4e21-9370-273d95cc3247">

  - Node Network Traffic was at it's peake and getting higher each minute as you can see from these 2 screenshots :

<img width="659" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/d065441c-a6fb-4cf9-af89-ccbf0ce279e9">
<img width="710" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/07224d90-2785-431c-bedc-b2f27e6548fd">

- From Dashboard, first it was waiting for vm to be ready

Waiting for deployment with contract_id: 240316 to be ready

   - Then got this error.

Failed to send request to twinId 7688 with command: zos.deployment.get, payload: {"contract_id":240316} Didn't get a response after 20 seconds


- Then Contracts got Cancled
   - ZOS logs : 

<img width="1135" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/bf18687f-c1a0-4c49-b452-a93c1ad3f52e">
   - Network Traffic still getting higher :
<img width="701" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/249983a6-ddc3-4962-927b-e334013babcb">

- Tried deploying nixos again, after network traffic decreased
   - ZOS logs :

[+] flistd: 2024-02-27T14:05:57Z info flist already in on the filesystem url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist

  - Network Traffic : 
<img width="697" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/fd505eee-9c59-48b6-a557-5736d313726c">

- VM was deployed successfully 
<img width="769" alt="image" src="https://github.com/threefoldtech/test_feedback/assets/13523434/295e3fc8-623d-4ba1-aba0-f9b5ada3e733">

- Did a quick speed test on the vm

root@thebatcavetest:~# speedtest-cli Retrieving speedtest.net configuration... Testing from Aussie Broadband (159.196.171.188)... Retrieving speedtest.net server list... Selecting best server based on ping... Hosted by Superloop Australia Pty Ltd (Sydney) [0.09 km]: 16.961 ms Testing download speed................................................................................ Download: 269.32 Mbit/s Testing upload speed...................................................................................................... Upload: 23.58 Mbit/s



@sabrinasadik Confirmed flist download from the hub takes long time, which cause a timeout on dashboard side then cancelling the contracts, but downloading the flist continues and deploying it again works after download is done.

sony87 commented 4 months ago

So what you are saying is that i need to stay and re-deploying on the same machine untill it comples ?

sabrinasadik commented 4 months ago

Until we have a workaround or fix the issue, yes. @xmonader let's discuss further to have a solution for this.

threefoldtech / test_feedback