Hello @anjan-keysight @biplamal, seeing some flakes when bringing up KNE topos with IxiaTG:
creating topology: failed to create topology: Node "otg": Status FAILED Reason got failure in ixia CRD status: Container ixia-c failed - rpc error: code = Unknown desc = failed to pull and unpack image "us-west1-docker.pkg.dev/.../ixia-c-controller:0.0.1-4013": failed to copy: read tcp 172.18.0.2:47650->74.125.132.82:443: read: connection reset by peer
This happens rarely (less than 1%) but is still affecting our KNE test runs. It appears to me that the Ixia operator treats image pull as a FAILED state (https://github.com/open-traffic-generator/ixia-c-operator/blob/a6bc34d9bc987a7d01869cfbfae670c7294862b7/README.md#ixiatg-crd) however cases where there is a flake in the pull (interrupted, etc.) there should be retry allowed before returning FAILED. If the image is not found and thats what lead to the failure then FAILED makes sense, but for the transient pull errors FAILED is too harsh and instead INITIATED should be returned for a certain amount of pull failures before declaring FAILED.
K8s retires ErrImagePull failures automatically, however for ixiatg we poll the status from the operator and thats whats causing the error
This is specifically when there is a transient error with kubernetes pulling the image from a remote repo (in this case read: connection reset by peer). Normally k8 silently retries these errors and will hang in a backoff loop indefinitely. The ixia-c operator treats these transient errors as unrecoverable failures.
Hello @anjan-keysight @biplamal, seeing some flakes when bringing up KNE topos with IxiaTG:
This happens rarely (less than 1%) but is still affecting our KNE test runs. It appears to me that the Ixia operator treats image pull as a FAILED state (https://github.com/open-traffic-generator/ixia-c-operator/blob/a6bc34d9bc987a7d01869cfbfae670c7294862b7/README.md#ixiatg-crd) however cases where there is a flake in the pull (interrupted, etc.) there should be retry allowed before returning FAILED. If the image is not found and thats what lead to the failure then FAILED makes sense, but for the transient pull errors FAILED is too harsh and instead INITIATED should be returned for a certain amount of pull failures before declaring FAILED.
K8s retires ErrImagePull failures automatically, however for ixiatg we poll the status from the operator and thats whats causing the error
This is specifically when there is a transient error with kubernetes pulling the image from a remote repo (in this case
read: connection reset by peer
). Normally k8 silently retries these errors and will hang in a backoff loop indefinitely. The ixia-c operator treats these transient errors as unrecoverable failures.