haskell-works / cabal-cache

CI assistant
BSD 3-Clause "New" or "Revised" License
44 stars 11 forks source link

Should cabal-cache retry failed downloads/uploads? #191

Closed hasufell closed 1 year ago

hasufell commented 1 year ago

It seems there's a lot of spurious failures in CIs and cabal-cache will just fail hard if one request didn't make it. Shouldn't there be some retry logic wrt S3 requests?

newhoggy commented 1 year ago

There is some retry logic here: https://github.com/haskell-works/cabal-cache/blob/newhoggy/handle-all-exceptions-from-download/src/HaskellWorks/CabalCache/IO/Lazy.hs#L187-L201

hasufell commented 1 year ago

https://github.com/haskell/haskell-language-server/actions/runs/3751326564/jobs/6372217119#step:3:15727

Weird. Looks like it didn't retry on 503:

cabal-cache sync-to-archive --host-name-override=*** --host-port-override=443 --host-ssl-override=True --region us-west-2 --store-path="/Users/worker/hls/_work/haskell-language-server/haskell-language-server/store" --archive-uri "s3://haskell-language-server/aarch64-apple-darwin"
cabal-cache: ServiceError (ServiceError' {_serviceAbbrev = Abbrev "S3", _serviceStatus = Status {statusCode = 503, statusMessage = "Service Temporarily Unavailable"}, _serviceHeaders = [("Date","Wed, 21 Dec 2022 18:01:26 GMT"),("Content-Type","text/plain"),("Content-Length","148"),("Connection","keep-alive"),("x-amz-request-id","tx000000000000000000000-0000000000-0000-default")], _serviceCode = ErrorCode "Service Temporarily Unavailable", _serviceMessage = Nothing, _serviceRequestId = Just (RequestId "tx000000000000000000000-0000000000-0000-default")})

But exited with non-zero code.

hasufell commented 1 year ago

It appears this is on upload and it happens frequently enough for me that it's a problem: https://github.com/haskell/haskell-language-server/actions/runs/3751326564/jobs/6377428541#step:3:9197

hasufell commented 1 year ago

I believe this is the code that fails:

https://github.com/haskell-works/cabal-cache/blob/0b76461eeea281d623906745a81a7ed0cb42d30c/app/App/Commands/SyncToArchive.hs#L93-L96

It uses unsafeInterleaveIO, which makes the logs look a little off I think.

newhoggy commented 1 year ago

That's interesting. How did you figure that out?

The unsafe interleave is so that if you never access AWS you don't fail for AWS reasons - for example if syncing to/from filesystem.

hasufell commented 1 year ago

I also noticed that our retry logic is inconsistent due to:

handleAwsError :: MonadCatch m => m a -> m (Either AppError a)
handleAwsError f = catch (Right <$> f) $ \(e :: AWS.Error) ->
  case e of
    (AWS.ServiceError (AWS.ServiceError' _ s@(HTTP.Status 404 _) _ _ _ _)) -> return (Left (AwsAppError s))
    (AWS.ServiceError (AWS.ServiceError' _ s@(HTTP.Status 301 _) _ _ _ _)) -> return (Left (AwsAppError s))
    _                                                                      -> throwM e

Most of the status errors are not encapsulated into ExceptT and so the retry function will not trigger for 5xx etc. either.

newhoggy commented 1 year ago

Ah right. Yeah, perhaps it's better when catching errors to convert them all to ExceptT errors.

BTW, I noticed your conversation in the oops repository. Did you end up choosing between plucky, oops or some other error handling library?

hasufell commented 1 year ago

https://github.com/i-am-tom/oops/issues/2#issuecomment-1362310143

newhoggy commented 1 year ago

Awesome. Thanks!