Closed diwakergupta closed 4 years ago
It does seem to resolve itself over time - but it's not clear how or why. Last night, I also looked at the DB where the data comes from, and everything there seemed accurate.
Just to add on to @diwakergupta comment above, along with clearing the redis cache, i also cleared the local cache by restarting the containers.
I'll bring this up tomorrow in the devx sprint check-in meeting
Redis self-expires, and I think all other caches do as well. My guess is that this aggregation process sometimes occurs while the postgres dump is occurring from Stacks v1 sqlite db. A while back @wileyj and I did some work to lock the postgres db during this process. Was there a regression? Or maybe the locking isn't thorough enough?
Edit: just to clarify, caching is a red herring here. The fix shouldn't be an attempt to detect this bug and clear caches.
Redis self-expires, and I think all other caches do as well. My guess is that this aggregation process sometimes occurs while the postgres dump is occurring from Stacks v1 sqlite db. A while back @wileyj and I did some work to lock the postgres db during this process. Was there a regression? Or maybe the locking isn't thorough enough?
Edit: just to clarify, caching is a red herring here. The fix shouldn't be an attempt to detect this bug and clear caches.
Even when the cache expires (or when it is manually cleared), the fact that doesn't reliably fix the problem suggests the cache is not populated on demand (e.g. on a cache miss) -- is there some process that periodically populates the cache from the DB?
what might be helpful from devops side is a script to force populate the cache when this issue is detected, combined with the monitoring changes to detect the issue
Looks like the default redis caching behavior was not set to self expire. Just opened a PR to set the default to 10 minutes https://github.com/blockstack/blockstack-explorer-api/pull/45
Aside from potentially CDN cache, I think this error should self resolve a lot quicker now.
is there some process that periodically populates the cache from the DB?
There's a keep-alive that keeps caches refreshed for a couple main pages. It uses redis though, if redis is cleared (by itself or manually) then there wouldn't be any data for the API to return. It would re-query and return fresh data or error trying to.
It may be either
The Kibana logs from around the time this occurred may indicate?
@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.
@diwakergupta can you comment on this issue if you see this again?
Thanks for jumping in on this, Matt.
@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.
@diwakergupta can you comment on this issue if you see this again?
Will do, thanks @zone117x @wileyj !
I'd still like to have some alerting on this before we close out this issue though. @hstove?
FYI, the last 4 days of the Kibana logs for the API show no error responses related to this home page query. So I don't think we could use the API endpoint(s) for alerting on this specific error.
It seems to be either an API response with empty data (because of a postgres race condition), or potentially some front-end bug.
+1 to alerting. Right now community members seem to be the human alerts 😆
Anyone available to take on ownership for this alerting?
shouldn't be too difficult to implement: if we can create an api endpoint that only exposes the json used to create the homepage data (example in https://github.com/blockstackpbc/devops/issues/366 ), we can wire something up pretty easily. We can't currently use the homepage though, since we have no way to parse the html to extract the embedded json.
@zone117x @hstove the issue re-occurred today.
@wileyj and I made some deployment modifications, and the issue should either be resolved, or if they happen again should be easier to debug using the logs.
Can we check the logs?
Anyone available to take on ownership for this alerting?
DevX team owns the Explorer, so ideally the team would take ownership of the alerts. DevOps team can certainly help and can be the first line of defense in this case given the legacy nature of this Explorer.
https://github.com/blockstackpbc/devops/issues/366
we do have an alert now that will be alerting on some specific metrics
Summarizing various discussions here:
what is still missing though is a reliable mitigation for scenario (2) above, where data in DB is correct
Redis is the cache. Clearing redis would result in the db being re-queried. It will auto clear itself in ~10 minutes without any intervention by default. If there are other layers of caching, e.g. cloudflare or headers that cause the local browser to cache, those should be removed.
what is still missing though is a reliable mitigation for scenario (2) above, where data in DB is correct
Redis is the cache. Clearing redis would result in the db being re-queried. It will auto clear itself in ~10 minutes without any intervention by default. If there are other layers of caching, e.g. cloudflare or headers that cause the local browser to cache, those should be removed.
@wileyj note above on Redis. Let's make sure we get rid of all other caching for legacy explorer, including Cloudflare.
I have disabled the cache in cloudflare, and I did find a bug in the code that loads the DB actually. details here: https://github.com/blockstackpbc/devops/issues/419
I've also updated the code so it will clear the cache if the subdomain_records
table is updated.
TL;DR - when pgloader would have an error, it's possible the subdomain_records
table was also empty for a short amount of time until the next successful run (typically less than 10 minutes). code now cleans up any orphaned tables in the case of pgloader error.
The above added health checks really helped tracking down this bug - previously, it always appeared that the DB was totally fine. I was able to see this by checking the DB/cache etc as soon as the alerts came in.
can this be closed?
I would say so - this was addressed in a separate issue, we haven't seen an issue since. At the very least, there's nothing devx related here so i'll close it
This has happened at least twice in the past 2-3 weeks -- we don't know the exact frequency as we don't have any alerting around this type of inconsistency yet.
In the last occurrence, the hypothesis was that this was a caching issue and wiping out the redis-cache seemed to have fixed it. More context at https://github.com/blockstackpbc/devops/issues/366
However, per @wileyj , today wiping the cache did not help.
Until Stacks 2.0 mainnet is live, we need to treat the legacy Explorer as a critical asset -- many people, investors and casual observers alike, depend on the legacy Explorer to show correct data and will continue to do so for a while longer.
A couple asks for the @blockstack/dx-team :
Assigning to Alex for further discussion and prioritization.