livepeer / catalyst-api

MIT License
4 stars 2 forks source link

Fix locking during session invalidation #1377

Closed leszko closed 2 months ago

leszko commented 2 months ago

It addresses the problem we discovered during catalyst deployment.

Solution

Execute session invalidate HTTP requests to Mist:

  1. Outside locks
  2. In a separate goroutine
  3. With timeout

In theory any of these points would solve the issue, but it's better and safer to implement all of them.

Problem Description

When we deployed catalyst, then in most cases USER_NEW trigger was blocked and we saw the following error:

MistOutFLV:video+9cfc2xmk4pvz6jo7 (958765) FAIL: STREAM_SOURCE trigger: http://127.0.0.1:7979/api/mist/trigger (blocking) failed to execute (Timed out), using default response: true

Analysis what happened

  1. In the periodicRefreshIntervalCache() func we acquired a lock and made an HTTP call to Mist
  2. The HTTP req does not have any timeout, so if Mist hasn't ever replied or closed the connection, then this hangs forever; that caused the lock to be acquired forever
  3. The same lock is used in handling USER_NEW trigger, so in effect handling USER_NEW trigger timed out

I still don't understand why it hadn't happened when catalyst-api was embedded inside catalyst, but some possible explanations:

pwilczynskiclearcode commented 2 months ago

I have zero knowledge about how it works in practice so please don't wait for my approval. Some general comments/questions:

mjh1 commented 2 months ago

I think most of Pawel's questions will probably be answered by a bit more debugging (sorry @leszko) I think we should add a log line to print the number of playbackIDs we're invalidating in the go routine, then we'll have an idea of how long we'll be waiting if they are timing out. We will also then see which are timing out as there is already a log line for that, then we can check (maybe with Jaron's help) to understand why they're timing out. We'll probably then know whether we need to tweak the timeout, 60s seems ok for now. So maybe one more test run, or just merge this but run another catalyst deployment after this catalyst-api change goes out.

leszko commented 2 months ago

Good points Paweł!

  • there's lack of debug-level logs for any further investigation. I guess that INFO-level might be too spammy.

We do have logs, it's just that if the request hangs, then nothing is logged. But now with the timeout, here's the log will need to monitor during the next catalyst deployment. I could add some more debug logs like I did in while digging into this issue, but TBH it's hard to decide what should be logged. While debugging, I added a line to basically every function call 🙃

  • would we see an error each time retry fails?

Yes, exactly what I mentioned in the previous point. Here's the log we'll see.

  • 60 seconds sounds like super-long time... would we accumulate tons of those 60s pending requests if mist stops responding at all. Shorter timeout with a backoff might be a better solution.

Yes, the thing is that I'm afraid of setting it lower. This is for all Mist Client calls and I'm not sure how fast Mist replies. I'm afraid I'll set it to 5s and suddenly we realize that some other random part of catalyst is failing, because it always worked, it's just that Mist replied in 30s for a request.

  • Do we eventually break the retry logic. What happens if we don't invalidate the session?

This is in line with my conversation with @mjh1 today. So basically, maybe there is a problem with session invalidation or bug in Mist that when catalyst-api calls for some certain stream or a stream that does not exist, then it hangs forever. I think what we should do is to check logs during the next deployment. But, in short, my understanding is that if we don't invalidate a session, then a viewer that is not allowed to watch a stream will be able to watch a stream. Not good, but better than not allowing anyone 🙃

  • when using goroutine we may start doing many parallel requests... current synchronous approach would wait for the next "round" of "local" invalidation until mist invalidation is done.

Yes, that's true. If one "invalidate session cycle" does not finish in 5s then we can have parallel invalidate session groutines running. This should not be a problem unless we start to have some congestion. Hmm, the reason why I added the goroutine is that if for some reason mapic.InvalidateAllSessions(key) got stuck again, we still have it running. But right that this could be prone to a congestion. Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?

  • what's the consequence of the session invalidation on the catalyst-api side while not invalidating on mist side? should we monitor that with a metric?

I think that it will not stop viewer watching while it should be stopped. So bad thing, not critical. We could add a metric or maybe it's enough to monitor the metrics we already have (so the failures of any Mist HTTP calls)? wdyt?

pwilczynskiclearcode commented 2 months ago

We don't have any of these failure counts anywhere on the existing grafana dashboards but great that we collect it!

mjh1 commented 2 months ago

Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?

@leszko It might be better to keep it simpler for now and remove the separate go routine, I'm not sure we know for sure it will help us. For example what if the next round of invalidate calls also hang, we have another goroutine sitting there. Probably not a huge danger when the loop is only running every 5 seconds but it might not be worth it, unless you saw in testing it would help?

mjh1 commented 2 months ago

@leszko btw did you test the timeout change in prod? I'm wondering if we just saw the first batch of invalidate calls time out but from then on they were fine?

leszko commented 2 months ago

Maybe the logic should be like: "run it synchronously, but if it takes more than 5min, then just proceed with the invalidation cycle". @mjh1 do you have any thoughts on that?

@leszko It might be better to keep it simpler for now and remove the separate go routine, I'm not sure we know for sure it will help us. For example what if the next round of invalidate calls also hang, we have another goroutine sitting there. Probably not a huge danger when the loop is only running every 5 seconds but it might not be worth it, unless you saw in testing it would help?

Good point. Ok, let me remove this goroutine. Since I already merged this PR, I'll push directly to main.

leszko commented 2 months ago

Also will add one more log to reason if this code was stuck

leszko commented 2 months ago

Added here: 382b285e2f72f5ff1d4e19512e3443318fc581bf

leszko commented 2 months ago

@leszko btw did you test the timeout change in prod? I'm wondering if we just saw the first batch of invalidate calls time out but from then on they were fine?

I thought about checking this during the deployment. But maybe a good idea to check it first in one region, before actually making the deployment.