The tutorial e2e has been failing where it expects the experiment to be in state failed, but the experiment is in started state
I think it is a combination of two things:
incomplete cluster refresh after completed tests
this seems to leave the prometheus server in an unstable state. It crashes a few times, which can prevent glooshot from registering the failure conditions
insufficient timeouts - we don't give enough time on the first image pull so retries are needed. The first attempt primes the container repo, the second attempt passes the test
The following workaround allowed v0.0.5 to pass
clear the resources manually
kubectl delete ns bookinfo
kubectl delete ns glooshot
# namespace (do in background to ignore not-exist error)
kubectl delete ns istio-system &
# cluster-scoped resources
for i in `kubectl get customresourcedefinitions -o=jsonpath="{.items[*].metadata.name}"`; do echo $i |grep istio|xargs kubectl delete customresourcedefinition ; done
for i in `kubectl get clusterrole -o=jsonpath="{.items[*].metadata.name}"`; do echo $i |grep istio|xargs kubectl delete clusterrole ; done
for i in `kubectl get clusterrolebinding -o=jsonpath="{.items[*].metadata.name}"`; do echo $i |grep istio|xargs kubectl delete clusterrolebinding ; done
# do in background to ignore not-exist error
kubectl delete mutatingwebhookconfiguration istio-sidecar-injector &
# namespace-scoped resources in namespaces other than istio-system
for n in `kubectl get ns -o=jsonpath="{.items[*].metadata.name}"`; do
echo $n;
# delete each secret made by istio
for i in `kubectl get secrets -n=$n -o=jsonpath="{.items[*].metadata.name}"`; do echo $i |grep istio|xargs kubectl delete secret -n=$n; done
done
run the release once (expect failure, prime cache)
run the release again (expect success)
TODO
[ ] reset cluster after test
delete glooshot, bookinfo, istio, and supergloo resources
[ ] change test to wait for all bookinfo pods to be ready
e2e tests during release have been flakey
failed
, but the experiment is instarted
stateI think it is a combination of two things:
The following workaround allowed
v0.0.5
to passTODO