m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Should retry some storage errors. #986

Open gfr10598 opened 3 years ago

gfr10598 commented 3 years ago

We are currently seeing a low rate of GCS storage errors:

2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
textPayload: "2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
"

These would likely succeed on retry.

gfr10598 commented 3 years ago

Write failure errors

After adding a retry with a 2 second delay, we are still seeing the same write errors.

gfr10598 commented 3 years ago

Looks like there is very little retry happening the library. If I add a 20 second delay, and retry, it looks like the initial attempt takes between 0 and 5 seconds - not much retry. The Write retries then fail every 20 seconds - never succeed.

There is then a later failed retry, with fewer rows, likely driven by the Flush prior to Close at the end of the archive. Not clear what happened in between. Will investigate further.

2021/04/15 04:06:48 rowwriter.go:122: Retrying after 347234 of 385862 bytes 10m3s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
"2021/04/15 04:07:08 rowwriter.go:122: Retrying after 0 of 965 bytes 20s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
2021/04/15 04:16:48 task.go:179: Processed 4401 files, 0 nil data, 4359 rows committed, 42 failed, from gs://archive-measurement-lab/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz into annotation
gfr10598 commented 3 years ago

Tried running many retries, with 20 seconds between. In half a dozen failures, none ever later succeeded. The close also fails.

Checking GCS shows that the corresponding file still exists from a previous parsing, and has not been replaced.

Likely we should abandon the partially written file, probably by cancelling the context that was used to create the object handle?