nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.58k stars 133 forks source link

question(gcp/network/dhcp): sporadic failures to obtain internal IPv4 from dhcp on GCP #2029

Open rinor opened 2 months ago

rinor commented 2 months ago

While deploying https://github.com/nanovms/nanos/pull/2024, I've experienced some sporadic issues on some instances failing to obtain IPv4 address from GCP.

Note: I'm deploying to single vCPU f1-micro instances (not trying to test any SMP related in this case)

SeaBIOS (version 1.8.2-google)
Total RAM Size = 0x0000000026600000 = 614 MiB
CPUs found: 1     Max CPUs supported: 1
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=2097152 = 1024 MiB
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=8388608 = 4096 MiB
drive 0x000f2800: PCHS=0/0/0 translation=lba LCHS=1024/32/63 s=2097152
drive 0x000f27c0: PCHS=0/0/0 translation=lba LCHS=522/255/63 s=8388608
Sending Seabios boot VM event.
Booting from Hard Disk 0...
en1: assigned FE80::4001:AFF:FE80:3D0
# expected errors/complains from klibs (gcp,ntp,...)

Out of ~1_000 instances currently deployed and active within less than 36 hours in 4 different zones of us-central1 at least 100 of them are suspected to have experienced this "issue". Most of them just needed one restart to be back online, while for a couple of them it took 2+ restarts. There was no visible pattern about a specific location or a specific time.

Before that pr, I had https://github.com/nanovms/nanos/commit/57203bc1a6df1a757a4e4c33a4e5db8b9a2e0f8a deployed to a similar scenario, but with fewer instances ~400 and with a much slower deployment pace/frequency and had no such issue reported and/or experienced (doesn't mean that it did not happen though).

This is the base config used:

{
  "Program": "myapp",
  "Version": "myapp-af8b26d-sv70",
  "NanosVersion": "nanos-5779988",
  "Mounts": {
    "myapp-storage@${myappid}-v": "/storage"
  },
  "NameServers": [
    "169.254.169.254",
    "8.8.8.8",
    "1.1.1.1"
  ],
  "Klibs": [
    "gcp",
    "tls",
    "ntp",
    "cloud_init"
  ],
  "ManifestPassthrough": {
    "readonly_rootfs": "true",
    "exec_wait_for_ip4_secs": "5",
    "reboot_on_exit": "*",
    "ntp_servers": [
      "169.254.169.254"
    ],
    "gcp": {
      "metrics": {
        "interval": "300",
        "disk": {}
      }
    },
    "cloud_init": {
      "download_env": [
        {
          "auth": "",
          "src": "http://10.128.0.5:7367/config/{host}/{host}_env.json"
        }
      ]
    }
  },
  "CloudConfig": {
    "Spot": false,
    "Platform": "gcp",
    "ProjectID": "xxxxx",
    "BucketName": "xxxxx",
    "Flavor": "f1-micro",
    "InstanceProfile": "nanos-vm@xxxxx.iam.gserviceaccount.com",
    "VPC": "default",
    "Subnet": "default",
    "Zone": "us-central1-c",
    "Tags": [
      {
        "key": "service",
        "value": "myapp",
        "attribute": {
          "instance_label": true,
          "instance_network": true
        }
      }
    ]
  },
  "RunConfig": {
    "AttachVolumeOnInstanceCreate": true
  }
}

Atm, I don't have more information or other details. Nevertheless I plan to get back to this and test in a controlled environment.