istio / ztunnel

The `ztunnel` component of ambient mesh
Apache License 2.0
302 stars 101 forks source link

Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

Open joke opened 1 week ago

joke commented 1 week ago

After upgrading to istio 1.24 the ztunnel on some (or all) nodes had problems after startup and never got ready. The logs reported tens of thousands error like these.

It's hard to say if all nodes were affected because all karpenter nodepools have been relabeled causing nodes to be re-created in order to restart all pods forcing an injection of sidecars.

{"level":"warn","time":"2024-11-08T08:32:16.370762Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 16243 got: 15069, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371049Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 55666 got: 43184, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371061Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 47010 got: 27434, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371069Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 59852 got: 17994, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371086Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 38329 got: 27832, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371402Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 63660 got: 7533, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.372715Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 41538 got: 33375, dropped"}
{"level":"error","time":"2024-11-08T08:32:16.372955Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372966Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372968Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372970Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372971Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372973Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}

Restarting the ztunnel pod didn't solve the problem either. Only terminating the node solved the issue. Problems with too many open files have no been an issue before.

EKS: 1.31 Nodes: Bottlerocket 1.26.1 istio: 1.23.3 -> 1.24.0

howardjohn commented 1 week ago

Thanks for the report. Do you happen to know if the first logs were like expected message id: 16243 got: 15069, dropped or Too many open files? Clearly they are releated, but curious if there is any indication which one is the original cause.

It could be too many open files -> DNS issues, or DNS issues --> too many retries --> too many files.

Either way it is a bit odd Ztunnel restart did not resolve the issue.


Am I correct in understanding this only occured upgrading 1.23 to 1.24, and after restarting the nodes cleanly on 1.24, there is no issues?

bleggett commented 1 week ago

Might be worth running sysctl fs.file-nr on the affected nodes.

howardjohn commented 1 week ago

The issue is https://github.com/istio/ztunnel/pull/1282 / https://github.com/istio/istio/pull/52867.

In that, I said:

The supported upgrade path is CNI first, then Ztunnel.

CNI w/ this patch, Ztunnel 1.23: TCP will start redirecting, which is already supported by Ztunnel. UDP change does nothing

CNI + Ztunnel patched (with https://github.com/istio/ztunnel/pull/1282): DNS requests come from the application pod. UDP packets are marked, so they do not loop due to the CNI change here.

This is not quite right, since the CNI will not reconcile the iptables. So its really "CNI, restart all workloads, then upgrade ztunnel". Which is not great

howardjohn commented 1 week ago

Just to be very explicit - in the short term, the fix here is to restart your workloads and the issue will resolve. You can prevent the issue from happening in the first place by restarting your workloads between upgrading CNI and Ztunnel.

We are exploring some fixes that could drop this requirement in an upcoming patch release

joke commented 5 days ago

Sound about reasonable. I did do the CNI and Ztunnel deployments in parallel using an umbrella chart.

I just did another run did one update at the time as described above and everything went smoothly.

BTW the ambient upgrade guide does list the steps in the right order but the description of the Ztunnel does only state the control plane must be updated first.

Do you still require any further information like the order of the log messages? Seems to me the problem has been identified.

Just for the record:

bash-5.1# sysctl fs.file-nr
fs.file-nr = 17568      0       92233720368547758
howardjohn commented 5 days ago

Nope, this one we understand fully and are working on some fixes to make it not require careful upgrade sequencing. FWIW it was also a 1 time transition, so an upgrade from 1.24 to 1.25, for example, wouldn't have issues when if no changes were made