Open MattWellie opened 2 months ago
So I think the reason for this could be that the backoff
decorator only works when you have a session instantiated like:
session = await client.connect_async(reconnecting=True)
result = await session.execute(query)
Should also be noted that decorator effects don't get passed to methods that are called inside the the wrapped method so it is possible that the code in
does not pass the decorator effects to the session
object that is initiated inside execute_sync
as needed.
An approach that could be worth trying is to modify the RequestsHTTPTransport instance here:
to
transport = RequestsHTTPTransport(
url=url or get_sm_url(),
headers={'Authorization': f'Bearer {token}'},
retries=6,
retry_backoff_factor=1.0,
)
This would make 6 retries, spaced at 1s, 2s, 4s, 8s, 16s, 32s before giving up. I believe this can be tried without the backoff
decorator.
Alternatively, we could just instantiate the session
variable in the code and try to use the decorator like we are right now and see if that helps.
Happy to try out the retries/backoff in that object/client. I only went with the backoff decorator as it was working A-ok in my local experimentation and looked to be simple. After a few revisions I'm kinda sick of it now 😓
The backoff/retry was working, just not necessarily according to the timing options I set. It did seem to recognise the error types to catch etc. so I'm unsure about how some but not all of the arguments are being respected...
Mmmmm, yeah this does look tricky. I'd investigate it further but I'm not too clear on how to reproduce this, and I also do not have the perms to see that batch so most of this is a shot in the dark for me. If you have the bandwidth to try the retries/backoff in the object/client, we can rule out/assess next steps from there and I'll look into it a bit more then :(
Can we just roll our own? This should be fairly straightforward to implement.
From the logging messages emitted during graphQL queries it looks like the backoff isn't running for nearly as long as the code suggests it should.
Ref: https://github.com/populationgenomics/metamist/blob/f1b9811c0b4e9facd10e20456e78edbf1168eace/metamist/graphql/__init__.py#L145
Batch example: https://batch.hail.populationgenomics.org.au/batches/445352/jobs/1
Logging: