quintilesims / layer0

Build, Manage, and Deploy Your Applications
Apache License 2.0
44 stars 20 forks source link

Service retry logic incorrectly swallows errors and retries when it shouldn't #604

Open tlake opened 6 years ago

tlake commented 6 years ago

@The logs below illustrate a failing smoketest due to being run in the wrong region, but they also highlight a more important problem: the retry logic (for services, at least) incorrectly swallows an AWS UnsupportedFeatureException error and proceeds with retry logic when it should instead be failing immediately and returning that error. In other words, we only avoid retry logic for a whitelist of errors, when we should only be entering retry logic for a whitelist of errors. Failing immediately should be the default, not retrying.

Logs from service smoketests:

bats service.bats
 ✗ create
   (in test file service.bats, line 10)
     `l0 service create env_name svc_name_stateless1 dpl_name_stateless:latest' failed
   ENVIRONMENT ID  ENVIRONMENT NAME  OS     LINKS
   envname360ae    env_name          linux  
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplname54de5  dpl_name_stateful1  2        stateful
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplname10bf4  dpl_name_stateful2  2        stateful
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplnamef885d  dpl_name_stateless  12       stateful
                                              stateless
   LOADBALANCER ID  LOADBALANCER NAME  TYPE         ENVIRONMENT  SERVICE  PORTS       PUBLIC  URL
   lbnamea28010     lb_name_alb        application  env_name              80:80/HTTP  true    l0-jlprefactortest-lbnamea28010-1930624962.us-west-2.elb.amazonaws.com
   LOADBALANCER ID  LOADBALANCER NAME  TYPE     ENVIRONMENT  SERVICE  PORTS       PUBLIC  URL
   lbnamecdaff3     lb_name_clb        classic  env_name              80:80/HTTP  true    l0-jlprefactortest-lbnamecdaff3-1410965074.us-west-2.elb.amazonaws.com
   2018/04/13 09:58:02 Maximum retry attempts reached

Logs from layer0 API:

2018/04/13 09:57:59 POST /service 
2018/04/13 09:58:00 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cdfcd78e-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:00 POST /service 
2018/04/13 09:58:01 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cea9e083-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:01 POST /service 
2018/04/13 09:58:02 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cf58700a-3f3b-11e8-99b2-7f217f1e677c
zpatrick commented 6 years ago

From Slack

retry(fn) return err



that could work because it should retry again and `err = foo()` should make `err = nil` if `foo()` succeeds. If it retries or times out, `err` keeps the last error