threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
79 stars 12 forks source link

Vm deployment with GPU crashes #2349

Closed maayarosama closed 3 weeks ago

maayarosama commented 1 month ago

Describe the bug

I tried deploying a VM on node 4771 with a GPU after renting it, but I couldn't ssh to it, after some investigation and getting the changes of the contract, I noticed that the status of the contract is deleted


[
  {
    version: 0,
    name: 'vmgpuDisk',
    type: 'zmount',
    data: { size: 107374182400 },
    metadata: '',
    description: 'test deploying VM with GPU via ts grid3 client',
    result: { created: 1717314878, state: 'init', message: '', data: null }
  },
  {
    version: 0,
    name: 'vmgpu',
    type: 'zmachine',
    data: {
      flist: 'https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist',
      network: [Object],
      size: 0,
      compute_capacity: [Object],
      mounts: [Array],
      entrypoint: '/',
      env: [Object],
      corex: false,
      gpu: [Array]
    },
    metadata: '',
    description: 'test deploying VM with GPU via ts grid3 client',
    result: { created: 1717314878, state: 'init', message: '', data: null }
  },
  {
    version: 0,
    name: 'vmgpuDisk',
    type: 'zmount',
    data: { size: 107374182400 },
    metadata: '',
    description: 'test deploying VM with GPU via ts grid3 client',
    result: { created: 1717314880, state: 'ok', message: '', data: [Object] }
  },
  {
    version: 0,
    name: 'vmgpu',
    type: 'zmachine',
    data: {
      flist: 'https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist',
      network: [Object],
      size: 0,
      compute_capacity: [Object],
      mounts: [Array],
      entrypoint: '/',
      env: [Object],
      corex: false,
      gpu: [Array]
    },
    metadata: '',
    description: 'test deploying VM with GPU via ts grid3 client',
    result: { created: 1717314887, state: 'ok', message: '', data: [Object] }
  },
  {
    version: 0,
    name: 'vmgpu',
    type: 'zmachine',
    data: {
      flist: 'https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist',
      network: [Object],
      size: 0,
      compute_capacity: [Object],
      mounts: [Array],
      entrypoint: '/',
      env: [Object],
      corex: false,
      gpu: [Array]
    },
    metadata: '',
    description: 'test deploying VM with GPU via ts grid3 client',
    result: {
      created: 1717314925,
      state: 'deleted',
      message: 'workload decommissioned by system, reason: deleting vm due to so many crashes',
      data: null
    }
  }
]

I also tried deploying a VM on the same node without GPU and everything was fine

To Reproduce

Steps to reproduce the behavior:

1. Rent node ```4771```
2. deploy a vm with gpu on it
3. try to ssh
ashraffouda commented 1 month ago

which env?

maayarosama commented 1 month ago

which env?

Mainnet

ashraffouda commented 1 month ago

seems there is an issue with this node since it worked on other nodes here the contract is created Screenshot from 2024-06-02 17-27-35 and here like after 1hour the node received a cancel contract event Screenshot from 2024-06-02 17-30-40

ashraffouda commented 4 weeks ago

seems there is an issue with this specific node because cloud container is giving error deleting vm due to so many crashes I verified the node has the correct zos version also the node is not in freefarm so we can not ssh to it to check on the node itself

ashraffouda commented 4 weeks ago

also for some reason grid proxy seems giving wrong info about this specific node because now it says the node has gpu but the node doesn't show up as a rentable node while it is not rented also doens't have network contracts image

ashraffouda commented 4 weeks ago

let's follow up here for grid proxy first https://github.com/threefoldtech/tfgrid-sdk-go/issues/1061

ashraffouda commented 3 weeks ago

so this look like a node issue maybe network issue, I tried using grid-cli it sometimes fails and sometimes it succeed something like the following Screenshot from 2024-06-05 16-59-49

Screenshot from 2024-06-05 17-00-15

ashraffouda commented 3 weeks ago

now it works all times maybe it was a network issue and it is fixed