Aggregate stats are broken

diwakergupta commented 4 years ago

This has happened at least twice in the past 2-3 weeks -- we don't know the exact frequency as we don't have any alerting around this type of inconsistency yet.

Screenshot 2020-08-16 at 10 10 31 PM

In the last occurrence, the hypothesis was that this was a caching issue and wiping out the redis-cache seemed to have fixed it. More context at https://github.com/blockstackpbc/devops/issues/366

However, per @wileyj , today wiping the cache did not help.

Until Stacks 2.0 mainnet is live, we need to treat the legacy Explorer as a critical asset -- many people, investors and casual observers alike, depend on the legacy Explorer to show correct data and will continue to do so for a while longer.

A couple asks for the @blockstack/dx-team :

Can someone find and confirm the root cause here? Exactly what this caching issue is, why it happens and how it manifests?
If it's indeed just a cache issue, does it self-correct after some time? If yes, under what constraints? If not, why not? If we're having to wipe the redis cache completely, is it the case that the cache is in an unrecoverable bad state where the only solution is to wipe the entire cache?
I believe @hstove was already looking into making some code changes that would allow us to alert on this in future. Could we get that prioritized for this week?

Assigning to Alex for further discussion and prioritization.

wileyj commented 4 years ago

It does seem to resolve itself over time - but it's not clear how or why. Last night, I also looked at the DB where the data comes from, and everything there seemed accurate.

Just to add on to @diwakergupta comment above, along with clearing the redis cache, i also cleared the local cache by restarting the containers.

agraebe commented 4 years ago

I'll bring this up tomorrow in the devx sprint check-in meeting

zone117x commented 4 years ago

Redis self-expires, and I think all other caches do as well. My guess is that this aggregation process sometimes occurs while the postgres dump is occurring from Stacks v1 sqlite db. A while back @wileyj and I did some work to lock the postgres db during this process. Was there a regression? Or maybe the locking isn't thorough enough?

Edit: just to clarify, caching is a red herring here. The fix shouldn't be an attempt to detect this bug and clear caches.

diwakergupta commented 4 years ago

Redis self-expires, and I think all other caches do as well. My guess is that this aggregation process sometimes occurs while the postgres dump is occurring from Stacks v1 sqlite db. A while back @wileyj and I did some work to lock the postgres db during this process. Was there a regression? Or maybe the locking isn't thorough enough?

Edit: just to clarify, caching is a red herring here. The fix shouldn't be an attempt to detect this bug and clear caches.

Even when the cache expires (or when it is manually cleared), the fact that doesn't reliably fix the problem suggests the cache is not populated on demand (e.g. on a cache miss) -- is there some process that periodically populates the cache from the DB?

wileyj commented 4 years ago

what might be helpful from devops side is a script to force populate the cache when this issue is detected, combined with the monitoring changes to detect the issue

zone117x commented 4 years ago

Looks like the default redis caching behavior was not set to self expire. Just opened a PR to set the default to 10 minutes https://github.com/blockstack/blockstack-explorer-api/pull/45

Aside from potentially CDN cache, I think this error should self resolve a lot quicker now.

zone117x commented 4 years ago

is there some process that periodically populates the cache from the DB?

There's a keep-alive that keeps caches refreshed for a couple main pages. It uses redis though, if redis is cleared (by itself or manually) then there wouldn't be any data for the API to return. It would re-query and return fresh data or error trying to.

It may be either

The front-end is still constructing a broken page (like screenshotted above) when an error response is returned.
Due to postgres race conditions during the pgload process, the API could end up returning a 200 response but with some empty data.

The Kibana logs from around the time this occurred may indicate?

zone117x commented 4 years ago

@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.

@diwakergupta can you comment on this issue if you see this again?

hstove commented 4 years ago

Thanks for jumping in on this, Matt.

diwakergupta commented 4 years ago

@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.

@diwakergupta can you comment on this issue if you see this again?

Will do, thanks @zone117x @wileyj !

I'd still like to have some alerting on this before we close out this issue though. @hstove?

zone117x commented 4 years ago

FYI, the last 4 days of the Kibana logs for the API show no error responses related to this home page query. So I don't think we could use the API endpoint(s) for alerting on this specific error.

It seems to be either an API response with empty data (because of a postgres race condition), or potentially some front-end bug.

muneeb-ali commented 4 years ago

+1 to alerting. Right now community members seem to be the human alerts 😆

zone117x commented 4 years ago

Anyone available to take on ownership for this alerting?

wileyj commented 4 years ago

shouldn't be too difficult to implement: if we can create an api endpoint that only exposes the json used to create the homepage data (example in https://github.com/blockstackpbc/devops/issues/366 ), we can wire something up pretty easily. We can't currently use the homepage though, since we have no way to parse the html to extract the embedded json.

diwakergupta commented 4 years ago

@zone117x @hstove the issue re-occurred today.

@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.

Can we check the logs?

Anyone available to take on ownership for this alerting?

DevX team owns the Explorer, so ideally the team would take ownership of the alerts. DevOps team can certainly help and can be the first line of defense in this case given the legacy nature of this Explorer.

wileyj commented 4 years ago

https://github.com/blockstackpbc/devops/issues/366

we do have an alert now that will be alerting on some specific metrics

diwakergupta commented 4 years ago

Summarizing various discussions here:

there are 2 distinct issues that we've observed: sometimes the actual DB has stale data due to pgloader failing (say a timeout); other times the data in the DB is correct yet the frontend still shows incorrect stats (this is where we suspect caching)
we understand the pgloader issue well and mitigation is straight-forward when it happens
devops team has added alerts so we can quickly detect when symptoms appear per note above
what is still missing though is a reliable mitigation for scenario (2) above, where data in DB is correct but frontend still shows incorrect stats. @zone117x and/or @hstove if you can help provide the devops team a reliable mitigation method, I think we can wrap this issue up.

zone117x commented 4 years ago

what is still missing though is a reliable mitigation for scenario (2) above, where data in DB is correct

Redis is the cache. Clearing redis would result in the db being re-queried. It will auto clear itself in ~10 minutes without any intervention by default. If there are other layers of caching, e.g. cloudflare or headers that cause the local browser to cache, those should be removed.

diwakergupta commented 4 years ago

what is still missing though is a reliable mitigation for scenario (2) above, where data in DB is correct

Redis is the cache. Clearing redis would result in the db being re-queried. It will auto clear itself in ~10 minutes without any intervention by default. If there are other layers of caching, e.g. cloudflare or headers that cause the local browser to cache, those should be removed.

@wileyj note above on Redis. Let's make sure we get rid of all other caching for legacy explorer, including Cloudflare.

wileyj commented 4 years ago

I have disabled the cache in cloudflare, and I did find a bug in the code that loads the DB actually. details here: https://github.com/blockstackpbc/devops/issues/419 I've also updated the code so it will clear the cache if the subdomain_records table is updated.

TL;DR - when pgloader would have an error, it's possible the subdomain_records table was also empty for a short amount of time until the next successful run (typically less than 10 minutes). code now cleans up any orphaned tables in the case of pgloader error.

The above added health checks really helped tracking down this bug - previously, it always appeared that the DB was totally fine. I was able to see this by checking the DB/cache etc as soon as the alerts came in.

agraebe commented 4 years ago

can this be closed?

wileyj commented 4 years ago

I would say so - this was addressed in a separate issue, we haven't seen an issue since. At the very least, there's nothing devx related here so i'll close it

stacks-archive / blockstack-explorer

Aggregate stats are broken #227