threefoldtech / terraform-provider-grid

Apache License 2.0
8 stars 5 forks source link

QSFS: failed to update deployment #246

Open mohamedamer453 opened 2 years ago

mohamedamer453 commented 2 years ago

I was testing the flow mentioned in TC242 to test the reliability and stability of qsfs.

Firstly i started with a (16+4+4) setup

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16",
  "data17", "data18", "data19", "data20",
  "data21", "data22", "data23", "data24"]
}

minimal_shards = 16
expected_shards = 20

after deploying with this setup, i was able to ssh to the machine and write a 1gb file then i changed the setup as mentioned in the test case by removing 4 ZDBs.

new setup (16+4)

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16",
  "data17", "data18", "data19", "data20"]
}

after updating the deployment i was still able to ssh to the machine and access the old files and then i created a 300mb file and changed the setup again by removing another 4 ZDBs

new setup

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16"]
}

but when i tried to update the deployment this time i got the following error

╷
│ Error: failed to deploy deployments: error waiting deployment: workload 1 failed within deployment 5211 with error failed to update qsfs mount: failed to restart zstor process: non-zero exit code: 1; failed to revert deployments: error waiting deployment: workload 0 failed within deployment 5211 with error failed to update qsfs mount: failed to restart zstor process: non-zero exit code: 1; try again
│ 
│   with grid_deployment.qsfs,
│   on main.tf line 51, in resource "grid_deployment" "qsfs":
│   51: resource "grid_deployment" "qsfs" {
│ 
╵

provider "grid" { }

locals { metas = ["meta1", "meta2", "meta3", "meta4"] datas = ["data1", "data2", "data3", "data4", "data5", "data6", "data7", "data8", "data9", "data10", "data11", "data12", "data13", "data14", "data15", "data16", "data17", "data18", "data19", "data20", "data21", "data22", "data23", "data24"] }

resource "grid_network" "net1" { nodes = [7] ip_range = "10.1.0.0/16" name = "network" description = "newer network" }

resource "grid_deployment" "d1" { node = 7 dynamic "zdbs" { for_each = local.metas content { name = zdbs.value description = "description" password = "password" size = 10 mode = "user" } } dynamic "zdbs" { for_each = local.datas content { name = zdbs.value description = "description" password = "password" size = 10 mode = "seq" } } }

resource "grid_deployment" "qsfs" { node = 7 network_name = grid_network.net1.name ip_range = lookup(grid_network.net1.nodes_ip_range, 7, "") qsfs { name = "qsfs" description = "description6" cache = 10240 # 10 GB minimal_shards = 16 expected_shards = 20 redundant_groups = 0 redundant_nodes = 0 max_zdb_data_dir_size = 512 # 512 MB encryption_algorithm = "AES" encryption_key = "4d778ba3216e4da4231540c92a55f06157cabba802f9b68fb0f78375d2e825af" compression_algorithm = "snappy" metadata { type = "zdb" prefix = "hamada" encryption_algorithm = "AES" encryption_key = "4d778ba3216e4da4231540c92a55f06157cabba802f9b68fb0f78375d2e825af" dynamic "backends" { for_each = [for zdb in grid_deployment.d1.zdbs : zdb if zdb.mode != "seq"] content { address = format("[%s]:%d", backends.value.ips[1], backends.value.port) namespace = backends.value.namespace password = backends.value.password } } } groups { dynamic "backends" { for_each = [for zdb in grid_deployment.d1.zdbs : zdb if zdb.mode == "seq"] content { address = format("[%s]:%d", backends.value.ips[1], backends.value.port) namespace = backends.value.namespace password = backends.value.password } } } } vms { name = "vm" flist = "https://hub.grid.tf/tf-official-apps/base:latest.flist" cpu = 2 memory = 1024 entrypoint = "/sbin/zinit init" planetary = true env_vars = { SSH_KEY = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC533B35CELELtgg2d7Tsi5KelLxR0FYUlrcTmRRQuTNP9arP01JYD8iHKqh6naMbbzR8+M0gdPEeRK4oVqQtEcH1C47vLyRI/4DqahAE2nTW08wtJM5uiIvcQ9H2HMzZ3MXYWWlgyHMgW2QXQxzrRS0NXvsY+4wxe97MMZs9MDs+d+X15DfG6JffjMHydi+4tHB50WmHe5tFscBFxLbgDBUxNGiwi3BQc1nWIuYwMMV1GFwT3ndyLAp19KPkEa/dffiqLdzkgs2qpXtfBhTZ/lFeQRc60DHCMWExr9ySDbavIMuBFylf/ZQeJXm9dFXJN7bBTbflZIIuUMjmrI7cU5eSuZqAj5l+Yb1mLN8ljmKSIM3/tkKbzXNH5AUtRVKTn+aEPvJAEYtserAxAP5pjy6nmegn0UerEE3DWEV2kqDig3aPSNhi9WSCykvG2tz7DIr0UP6qEIWYMC/5OisnSGj8w8dAjyxS9B0Jlx7DEmqPDNBqp8UcwV75Cot8vtIac= root@mohamed-Inspiron-3576" } mounts { disk_name = "qsfs" mount_point = "/qsfs" } } } output "metrics" { value = grid_deployment.qsfs.qsfs[0].metrics_endpoint } output "ygg_ip" { value = grid_deployment.qsfs.vms[0].ygg_ip }

mohamedamer453 commented 2 years ago

I encountered this issue again in another scenario as described in TC354 & TC355.

The initial setup was (16+4+4)

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16",
  "data17", "data18", "data19", "data20",
  "data21", "data22", "data23", "data24"]
}

minimal_shards = 16
expected_shards = 20

and then i wrote some data small/mid size files and then updated the setup to (16+0+0)

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16"]
}

after killing 8 ZDBs the storage was working and i was able to access the files i created and then i re added 4 ZDBs but it failed and i initially got this error

╷
│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for grid_deployment.qsfs to include new values learned so far during apply, provider
│ "registry.terraform.io/threefoldtech/grid" produced an invalid new value for
│ .qsfs[0].groups[0].backends[16].namespace: was cty.StringVal(""), but now cty.StringVal("451-5297-data17").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for grid_deployment.qsfs to include new values learned so far during apply, provider
│ "registry.terraform.io/threefoldtech/grid" produced an invalid new value for
│ .qsfs[0].groups[0].backends[17].namespace: was cty.StringVal(""), but now cty.StringVal("451-5297-data18").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for grid_deployment.qsfs to include new values learned so far during apply, provider
│ "registry.terraform.io/threefoldtech/grid" produced an invalid new value for
│ .qsfs[0].groups[0].backends[18].namespace: was cty.StringVal(""), but now cty.StringVal("451-5297-data19").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for grid_deployment.qsfs to include new values learned so far during apply, provider
│ "registry.terraform.io/threefoldtech/grid" produced an invalid new value for
│ .qsfs[0].groups[0].backends[19].namespace: was cty.StringVal(""), but now cty.StringVal("451-5297-data20").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

and then i re applied again and finally it produced this error

╷
│ Error: failed to deploy deployments: error waiting deployment: workload 0 failed within deployment 5299 with error failed to update qsfs mount: failed to restart zstor process: non-zero exit code: 1; failed to revert deployments: error waiting deployment: workload 0 failed within deployment 5299 with error failed to update qsfs mount: failed to restart zstor process: non-zero exit code: 1; try again
│ 
│   with grid_deployment.qsfs,
│   on main.tf line 52, in resource "grid_deployment" "qsfs":
│   52: resource "grid_deployment" "qsfs" {
│ 
╵
mohamedamer453 commented 2 years ago

Same error occurred with the setup in TC356.

ad-astra-industries commented 2 years ago

Is there any update for this issue?