Backoff still not working well

MattWellie commented 2 months ago

From the logging messages emitted during graphQL queries it looks like the backoff isn't running for nearly as long as the code suggests it should.

Ref: https://github.com/populationgenomics/metamist/blob/f1b9811c0b4e9facd10e20456e78edbf1168eace/metamist/graphql/__init__.py#L145

Batch example: https://batch.hail.populationgenomics.org.au/batches/445352/jobs/1

Logging:

2024-05-09 00:39:35 INFO (backoff 105): Backing off query(...) for 0.4s (gql.transport.exceptions.TransportServerError: 504 Server Error: Gateway Timeout for url: https://sample-metadata-api-mnrpw3mdza-ts.a.run.app/graphql)
2024-05-09 00:44:36 ERROR (backoff 120): Giving up query(...) after 2 tries (gql.transport.exceptions.TransportServerError: 504 Server Error: Gateway Timeout for url: https://sample-metadata-api-mnrpw3mdza-ts.a.run.app/graphql)

> docker run australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_workflows:latest pip show metamist
Name: metamist
Version: 6.10.1
...

nevoodoo commented 2 months ago

Debugging

So I think the reason for this could be that the backoff decorator only works when you have a session instantiated like:

session = await client.connect_async(reconnecting=True)

result = await session.execute(query)

Should also be noted that decorator effects don't get passed to methods that are called inside the the wrapped method so it is possible that the code in

https://github.com/populationgenomics/metamist/blob/f1b9811c0b4e9facd10e20456e78edbf1168eace/metamist/graphql/__init__.py#L166-L169

does not pass the decorator effects to the session object that is initiated inside execute_sync as needed.

Potential Solution

An approach that could be worth trying is to modify the RequestsHTTPTransport instance here:

https://github.com/populationgenomics/metamist/blob/f1b9811c0b4e9facd10e20456e78edbf1168eace/metamist/graphql/__init__.py#L71-L74

to

transport = RequestsHTTPTransport(
            url=url or get_sm_url(),
            headers={'Authorization': f'Bearer {token}'},
            retries=6,
            retry_backoff_factor=1.0,
        )

This would make 6 retries, spaced at 1s, 2s, 4s, 8s, 16s, 32s before giving up. I believe this can be tried without the backoff decorator.

Alternatively, we could just instantiate the session variable in the code and try to use the decorator like we are right now and see if that helps.

MattWellie commented 2 months ago

Happy to try out the retries/backoff in that object/client. I only went with the backoff decorator as it was working A-ok in my local experimentation and looked to be simple. After a few revisions I'm kinda sick of it now 😓

The backoff/retry was working, just not necessarily according to the timing options I set. It did seem to recognise the error types to catch etc. so I'm unsure about how some but not all of the arguments are being respected...

nevoodoo commented 2 months ago

Mmmmm, yeah this does look tricky. I'd investigate it further but I'm not too clear on how to reproduce this, and I also do not have the perms to see that batch so most of this is a shot in the dark for me. If you have the bandwidth to try the retries/backoff in the object/client, we can rule out/assess next steps from there and I'll look into it a bit more then :(

illusional commented 1 month ago

Can we just roll our own? This should be fairly straightforward to implement.

populationgenomics / metamist

Backoff still not working well #769

Debugging

Potential Solution