Closed leszko closed 2 months ago
I have zero knowledge about how it works in practice so please don't wait for my approval. Some general comments/questions:
goroutine
we may start doing many parallel requests... current synchronous approach would wait for the next "round" of "local" invalidation until mist invalidation is done.I think most of Pawel's questions will probably be answered by a bit more debugging (sorry @leszko) I think we should add a log line to print the number of playbackIDs we're invalidating in the go routine, then we'll have an idea of how long we'll be waiting if they are timing out. We will also then see which are timing out as there is already a log line for that, then we can check (maybe with Jaron's help) to understand why they're timing out. We'll probably then know whether we need to tweak the timeout, 60s seems ok for now. So maybe one more test run, or just merge this but run another catalyst deployment after this catalyst-api change goes out.
Good points Paweł!
- there's lack of debug-level logs for any further investigation. I guess that INFO-level might be too spammy.
We do have logs, it's just that if the request hangs, then nothing is logged. But now with the timeout, here's the log will need to monitor during the next catalyst deployment. I could add some more debug logs like I did in while digging into this issue, but TBH it's hard to decide what should be logged. While debugging, I added a line to basically every function call 🙃
- would we see an error each time retry fails?
Yes, exactly what I mentioned in the previous point. Here's the log we'll see.
- 60 seconds sounds like super-long time... would we accumulate tons of those 60s pending requests if mist stops responding at all. Shorter timeout with a backoff might be a better solution.
Yes, the thing is that I'm afraid of setting it lower. This is for all Mist Client calls and I'm not sure how fast Mist replies. I'm afraid I'll set it to 5s and suddenly we realize that some other random part of catalyst is failing, because it always worked, it's just that Mist replied in 30s for a request.
- Do we eventually break the retry logic. What happens if we don't invalidate the session?
This is in line with my conversation with @mjh1 today. So basically, maybe there is a problem with session invalidation or bug in Mist that when catalyst-api calls for some certain stream or a stream that does not exist, then it hangs forever. I think what we should do is to check logs during the next deployment. But, in short, my understanding is that if we don't invalidate a session, then a viewer that is not allowed to watch a stream will be able to watch a stream. Not good, but better than not allowing anyone 🙃
- when using
goroutine
we may start doing many parallel requests... current synchronous approach would wait for the next "round" of "local" invalidation until mist invalidation is done.
Yes, that's true. If one "invalidate session cycle" does not finish in 5s then we can have parallel invalidate session groutines running. This should not be a problem unless we start to have some congestion. Hmm, the reason why I added the goroutine is that if for some reason mapic.InvalidateAllSessions(key)
got stuck again, we still have it running. But right that this could be prone to a congestion. Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?
- what's the consequence of the session invalidation on the catalyst-api side while not invalidating on mist side? should we monitor that with a metric?
I think that it will not stop viewer watching while it should be stopped. So bad thing, not critical. We could add a metric or maybe it's enough to monitor the metrics we already have (so the failures of any Mist HTTP calls)? wdyt?
We don't have any of these failure counts anywhere on the existing grafana dashboards but great that we collect it!
Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?
@leszko It might be better to keep it simpler for now and remove the separate go routine, I'm not sure we know for sure it will help us. For example what if the next round of invalidate calls also hang, we have another goroutine sitting there. Probably not a huge danger when the loop is only running every 5 seconds but it might not be worth it, unless you saw in testing it would help?
@leszko btw did you test the timeout change in prod? I'm wondering if we just saw the first batch of invalidate calls time out but from then on they were fine?
Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?
@leszko It might be better to keep it simpler for now and remove the separate go routine, I'm not sure we know for sure it will help us. For example what if the next round of invalidate calls also hang, we have another goroutine sitting there. Probably not a huge danger when the loop is only running every 5 seconds but it might not be worth it, unless you saw in testing it would help?
Good point. Ok, let me remove this goroutine. Since I already merged this PR, I'll push directly to main.
Also will add one more log to reason if this code was stuck
Added here: 382b285e2f72f5ff1d4e19512e3443318fc581bf
@leszko btw did you test the timeout change in prod? I'm wondering if we just saw the first batch of invalidate calls time out but from then on they were fine?
I thought about checking this during the deployment. But maybe a good idea to check it first in one region, before actually making the deployment.
It addresses the problem we discovered during catalyst deployment.
Solution
Execute session invalidate HTTP requests to Mist:
In theory any of these points would solve the issue, but it's better and safer to implement all of them.
Problem Description
When we deployed catalyst, then in most cases
USER_NEW
trigger was blocked and we saw the following error:Analysis what happened
periodicRefreshIntervalCache()
func we acquired a lock and made an HTTP call to MistUSER_NEW
trigger, so in effect handlingUSER_NEW
trigger timed outI still don't understand why it hadn't happened when catalyst-api was embedded inside catalyst, but some possible explanations: