hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
14.98k stars 3.33k forks source link

yamux: keepalive failed: i/o deadline reached #11509

Open pgrmega opened 2 years ago

pgrmega commented 2 years ago

Hello,

Overview of the Issue

We use packer to automatically build HyperV VMs. The scripts are directly executed on the HyperV server heberged in our own local infrastructure. We build several VMs in parallel (up to 7).

Sometime we encounter an error "yamux: keepalive failed: i/o deadline reached". We cannot determine a link between this error and any element (CPU, memory, network, disk, etc...).

We looked for any known issue looking like what we experiment and find #3378, #5360, #6410, #6475 But it didn't helped us => we are not running in Azure but on a Windows Server 2016.

Reproduction Steps

No specific element identified to reproduce the issue. We lowered the number of VMs in parallel. We tried to check if reboot of the server improve the behavior. But no way to find a determining element.

Packer version

From packer version 1.7.8

Operating system and Environment details

Windows Server 2016 Intel Xeon CPU X5650 @ 2.67GHz (2procs) RAM : 192 GB

Log Fragments and crash.log files

2022/01/19 23:01:22 [ERR] yamux: keepalive failed: i/o deadline reached 2022/01/19 23:02:07 [INFO] (telemetry) ending powershell 2022/01/19 23:02:22 [ERR] yamux: keepalive failed: i/o deadline reached 2022/01/19 23:02:22 [INFO] (telemetry) ending hyperv-vmcx 2022/01/19 23:02:22 [INFO] 537 bytes written for 'stderr' 2022/01/19 23:02:22 [INFO] 20095 bytes written for 'stdout' 2022/01/19 23:02:22 [ERR] Error decoding response stream 6: EOF 2022/01/19 23:02:22 [INFO] RPC endpoint: Communicator ended with: 123 2022/01/19 23:03:12 packer.exe plugin: [INFO] 537 bytes written for 'stderr' 2022/01/19 23:03:12 packer.exe plugin: [INFO] 20095 bytes written for 'stdout' 2022/01/19 23:03:12 packer.exe plugin: [ERR] Error decoding response stream 5: EOF 2022/01/19 23:03:12 packer.exe plugin: c:/Windows/Temp/script-61e85116-8b32-7360-3555-f83ea4f243bc.ps1 returned with exit code 123 2022/01/19 23:03:12 packer.exe plugin: waiting for all plugin processes to complete...

Thank you for any help

nywilken commented 2 years ago

Thanks for reaching out. We’ll take a look to see what might be causing this. Thanks for linking the other issues in the description.

azr commented 2 years ago

Hey @pgrmega, do you have more info for us to be able to repro this ? Like a simplified packer build config ? And, maybe the system + shell this is running on ?

We will also need the full log.

Thanks !

pgrmega commented 2 years ago

Hello,

We migrated the server on wich it was running from Windows Server 2016 to Windows Server 2019. And the problem does not occurs since this migration.

This was running on Powershell 5.2 on Windows Server 2016.

JSON Config Files :

  "variables": {
      ...
    },
  "builders": [
    {
    "type": "hyperv-vmcx",
    "clone_from_vmcx_path": "{{user `PackerVMCXPath`}}",
    "vm_name": "{{user `VMName`}}",
    "temp_path": "{{user `PackerTempPath`}}",
    "output_directory": "{{user `PackerExportPath`}}",
    "boot_wait": "5m",
    "headless": true,

    "generation": "{{user `VMGeneration`}}",
    "memory": 12288,
    "enable_dynamic_memory": true,
    "cpus": 6,
    "switch_name": "Microsoft Network Adapter Multiplexor Driver - Virtual Switch",

    "communicator": "winrm",
    "winrm_use_ssl": false, 
    "winrm_insecure": true,
    "winrm_port": 5985,
    "winrm_timeout": "1h",
    "winrm_username": "{{user `VMAdminUser`}}",
    "winrm_password": "{{user `VMAdminPwd`}}",

    "shutdown_command": "IF EXIST C:\\exploitation\\run-sysprep.cmd (CALL C:\\exploitation\\run-sysprep.cmd)",
    "shutdown_timeout": "30m"

    }
  ],
  "provisioners": [
    ...
  ]
}
Hi-Angel commented 7 months ago

I'm not even sure where this error is coming from. I cloned both the packer repo and the plugin I use, and neither has deadline reached text in their code 🤷‍♂️

rsergeevbgc commented 2 weeks ago

We seem to experience the same issue on the following Nomad version run on M3 macbook pro.

nomad -v
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

This happens when we run nomad in a dev mode to construct a local cluster for testing. No other jobs run on a cluster. The This is a portion of an error:

2024-06-17T11:51:57.163-0400 [DEBUG] worker: ack evaluation: worker_id=6b2571cd-ad0e-4bbe-eae8-8a1cf854b5d1 eval_id=550ddcd7-83f9-fc31-b812-cd91929c61b0 type=_core namespace=- job_id=deployment-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.163-0400 [DEBUG] core.sched: node GC scanning before cutoff index: index=0 node_gc_threshold=24h0m0s
2024-06-17T11:51:57.163-0400 [DEBUG] worker: ack evaluation: worker_id=a024c56d-604e-12af-60dc-b3f050845abd eval_id=48563a36-3321-9500-dbe1-6a23c0906474 type=_core namespace=- job_id=job-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.164-0400 [DEBUG] worker: ack evaluation: worker_id=1c06a282-c128-3f0e-3d3b-1dd79d22c8a9 eval_id=2df3a1de-a8f2-e7a2-a570-2deadf6815fd type=_core namespace=- job_id=node-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.290-0400 [WARN]  nomad.heartbeat: node TTL expired: node_id=68db21fe-a70b-ae80-16d4-1a4465059846
2024-06-17T11:52:09.858-0400 [ERROR] client: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:52:09.858-0400 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:52:09.858-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1s
2024-06-17T11:52:09.858-0400 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"
2024-06-17T11:52:09.865-0400 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["local"]
2024-06-17T11:52:09.869-0400 [INFO]  client.consul: discovered following servers: servers=[XX.XXX.XX.XXX:PPPP]
2024-06-17T11:52:10.862-0400 [DEBUG] client: state updated: node_status=ready
2024-06-17T11:52:10.862-0400 [WARN]  client: missed heartbeat: req_latency=1.723458ms heartbeat_ttl=17.502723309s since_last_heartbeat=41.074529s
2024-06-17T11:52:20.394-0400 [DEBUG] http: request complete: method=GET path=/v1/namespaces?index=1 duration=5m7.06801275s
2024-06-17T11:52:36.185-0400 [DEBUG] http: request complete: method=GET path="/v1/jobs?meta=true&namespace=%2A&index=1" duration=5m16.53492425s
2024-06-17T11:56:36.761-0400 [WARN]  nomad.heartbeat: node TTL expired: node_id=68db21fe-a70b-ae80-16d4-1a4465059846
2024-06-17T11:56:49.877-0400 [ERROR] client: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:56:49.877-0400 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:56:49.877-0400 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1s
2024-06-17T11:56:49.882-0400 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["local"]
2024-06-17T11:56:49.887-0400 [INFO]  client.consul: discovered following servers: servers=[XX.XXX.XX.XXX:PPPP]
2024-06-17T11:56:50.880-0400 [DEBUG] client: state updated: node_status=ready

This works normal when run on our CentOS servers.