microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 9 forks source link

STAC Search API Timeouts #34

Closed DFEvans closed 2 years ago

DFEvans commented 2 years ago

Over the last week, I've been seeing intermittent outages of the Planetary Computer STAC API Search Endpoint. Requests return the Error 500, with the following body:

{
  "detail": "canceling statement due to statement timeout"
}

This doesn't seem to depend on access method, or on the content of the request - e.g. even attempting to visit the search API with no parameters via my browser times out: https://planetarycomputer.microsoft.com/api/stac/v1/search?limit=250

The Status page claims that "STAC API: Search" is operational, and no incidents are noted: https://planetarycomputer-status.microsoft.com/

TomAugspurger commented 2 years ago

Thanks for opening this issue. Do you have notes on when the errors occurred? Earlier this morning (roughly 8 hours ago) there was some downtime from a data ingest causing a deadlock in the database. That should be fixed now.

The Status page claims that "STAC API: Search" is operational, and no incidents are noted: https://planetarycomputer-status.microsoft.com/

Yes, we'll need some finer grained / more comprehensive health checks there. The API application was health. It was the database having issues, so that actually doing a search failed.

DFEvans commented 2 years ago

It was out pretty consistently unavailable around 1100GMT-1400GMT, and then somewhat intermittently from then until 1700GMT. That first part looks like it more or less lines up with that data ingest issue.

Is it expected that this will occur whenever a data ingestion occurs, or was this something that hadn't gone quite right? (I can't quite work out whether "fixed" meant "ingestion done" or "database issue fixed")

If it's useful to you, I can come back with times if it occurs again - although my use is also intermittent, so I won't promise to be too accurate!

TomAugspurger commented 2 years ago

Is it expected that this will occur whenever a data ingestion occurs, or was this something that hadn't gone quite right?

A bit of both, unfortunately. We're working through a backlog of items to ingest so volumes are a bit larger than normal. But we're working develop a fix to avoid the conditions causing the deadlock in the first place.

TomAugspurger commented 2 years ago

We've made a few changes that should have fixed the worst of the timeouts.

In addition, https://github.com/microsoft/planetary-computer-apis/pull/52 implemented some more caching and rate-limiting, which should help with a second source of the timeouts. That will be deployed in our release next month.