populationgenomics / hail

Scalable genomic data analysis.
https://hail.is
MIT License
1 stars 1 forks source link

[services] indicate how many errors we have seen #288

Closed lgruen closed 1 year ago

lgruen commented 1 year ago

Cherry-picking https://github.com/hail-is/hail/pull/12984/commits (see https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/SocketException.20when.20writing.20Table/near/355889336).

We are increasingly seeing errors from "Connection reset" which we switched from "transient" to "retry once". The current code makes it impossible to determine if we are correctly retrying this error once.

If we see that there are a lot of "Connection reset" errors that happen twice we should perhaps change "retry once" to "retry five times" or use a more generous delay.