Open gfr10598 opened 3 years ago
After adding a retry with a 2 second delay, we are still seeing the same write errors.
Looks like there is very little retry happening the library. If I add a 20 second delay, and retry, it looks like the initial attempt takes between 0 and 5 seconds - not much retry. The Write retries then fail every 20 seconds - never succeed.
There is then a later failed retry, with fewer rows, likely driven by the Flush prior to Close at the end of the archive. Not clear what happened in between. Will investigate further.
2021/04/15 04:06:48 rowwriter.go:122: Retrying after 347234 of 385862 bytes 10m3s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
"2021/04/15 04:07:08 rowwriter.go:122: Retrying after 0 of 965 bytes 20s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
2021/04/15 04:16:48 task.go:179: Processed 4401 files, 0 nil data, 4359 rows committed, 42 failed, from gs://archive-measurement-lab/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz into annotation
Tried running many retries, with 20 seconds between. In half a dozen failures, none ever later succeeded. The close also fails.
Checking GCS shows that the corresponding file still exists from a previous parsing, and has not been replaced.
Likely we should abandon the partially written file, probably by cancelling the context that was used to create the object handle?
We are currently seeing a low rate of GCS storage errors:
These would likely succeed on retry.