soundcloud / api

A public repo for our Developer Community to engage about bugs and feature requests on our Public API
147 stars 23 forks source link

Baseline error rate of 503 errors still above previous baseline pre-August 20th #314

Open mgoodfellow opened 1 week ago

mgoodfellow commented 1 week ago

Hi,

Related to: https://github.com/soundcloud/api/issues/311

We are still seeing an increased error rate on the API:

image

These started around the 20th August and have been ongoing.

The errors are in one of 2 forms:

upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113

OR

upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111

Retries often work, and they are sporadic and spikey - so we generally see the errors clustered:

image

We are seeing these errors on Tracks, Reposts, Profiles primarily. It affects both read-only (GET) and mutation endpoints (PUT / POST etc).

Would be good to understand these as we don't retry mutations, only idempotent reads.

mgoodfellow commented 1 week ago

Hi @youssefhassan

Seeing a spike again in error rates of 503s, with the delayed connect error 113:

image

Also seeing spikes in response times across most endpoints.

mgoodfellow commented 1 week ago

Seems to be recovering again, the spike was from 16:48 -> 17:26 UTC time. Response times are now recovering as well.

youssefhassan commented 6 days ago

I'm keeping an eye on that and I really appreciate the reporting. It helps a lot. I will keep this thread open and please share whenever you see spikes of 500s and hopefully we will find a fix soon