open-traffic-generator / keng-operator

Other
6 stars 1 forks source link

Transient container pull from remote repos causes flakes with topology creation in KNE #27

Closed alexmasi closed 7 months ago

alexmasi commented 1 year ago

Hello @anjan-keysight @biplamal, seeing some flakes when bringing up KNE topos with IxiaTG:

creating topology: failed to create topology: Node "otg": Status FAILED Reason got failure in ixia CRD status: Container ixia-c failed - rpc error: code = Unknown desc = failed to pull and unpack image "us-west1-docker.pkg.dev/.../ixia-c-controller:0.0.1-4013": failed to copy: read tcp 172.18.0.2:47650->74.125.132.82:443: read: connection reset by peer

This happens rarely (less than 1%) but is still affecting our KNE test runs. It appears to me that the Ixia operator treats image pull as a FAILED state (https://github.com/open-traffic-generator/ixia-c-operator/blob/a6bc34d9bc987a7d01869cfbfae670c7294862b7/README.md#ixiatg-crd) however cases where there is a flake in the pull (interrupted, etc.) there should be retry allowed before returning FAILED. If the image is not found and thats what lead to the failure then FAILED makes sense, but for the transient pull errors FAILED is too harsh and instead INITIATED should be returned for a certain amount of pull failures before declaring FAILED.

K8s retires ErrImagePull failures automatically, however for ixiatg we poll the status from the operator and thats whats causing the error

This is specifically when there is a transient error with kubernetes pulling the image from a remote repo (in this case read: connection reset by peer). Normally k8 silently retries these errors and will hang in a backoff loop indefinitely. The ixia-c operator treats these transient errors as unrecoverable failures.

arkajyoti-cloud commented 1 year ago

will look into it.

arkajyoti-cloud commented 7 months ago

Fixed available in recent drops.