Cache should not be racy

steveklabnik commented 1 year ago

So in #42, one issue that comes up multiple times is cache misses.

Basically, Registrymanager::get_package is racy:

look up package in cache, if it's there, return it
if not, fetch the package
insert package into cache

This works but has the property of a "race condition." (but not a data race!!!) If we spin up two threads, and both are looking for the same package, and thread 1 goes "1, don't have it, go to step 2 and fetch" and then thread 2 starts and does the same, we've fetched the package twice. This manifests as a cache miss.

The alternative is to cause 1 to block in some fashion. We want it to hold some sort of lock, so that if someone else wants something we're working on, we ask them to wait until it's ready. But there's not just one way to do this!

The first way, and the way that's most obvious, is that you put a mutex around the whole thing. That way, if someone is downloading a new package, you wait until they're done. The issue there is that you'd end up blocking everyone when a package is being downloaded, which kind of defeats the purpose of downloading them in paralell, haha.

So my next thought is that we need to do is change the data modelling from "package exists or not" to "package doesn't exist, package is being downloaded, package is complete in the cache." This more properly represents what's going on in the real world. But it's not actually the best way forward. Why? Well, what happens when we have a cache hit? We want to simply return the package. If we know that some other task is already downloading the package, what we actually want is to return. We have no more work to do! Sure it's not ready yet, but the other task will make sure it is.

So, I think that's the task to fix this issue. I don't think it should be a super complex diff, but I'm not going to try and write a patch right this second.

steveklabnik commented 1 year ago

@boshen had asked this over there:

For cache misses, I want to learn the proper approach as well. Do we put the data structure in a single thread and communicate it with message passing, or do we use some mutexes and condvars as describe in this blog post https://medium.com/@polyglot_factotum/rust-concurrency-patterns-condvars-and-locks-e278f18db74f?

I think both can work, it just depends. Well rather, I would never use a condvar unless you're absolutely sure you need a condvar ;)

The basic question is "Mutex or message passing?" and in my opinion, mutexes tend to just be easier more often. Really just depends, if you want to do an actor-style "this task owns this data and responds to messages about it" that works fine too.

steveklabnik commented 1 year ago

We want to simply return the package. If we know that some other task is already downloading the package, what we actually want is to return. We have no more work to do! Sure it's not ready yet, but the other task will make sure it is.

Speaking with @anonrig in private messages, this won't actually work, because the caller can assume that after this returns true, the package is downloaded, and so that could cause issues if that work began before the task that's currently downloading it is finished doing so. So this won't work.

So really what we're looking for is a sort of synchronization point, where for example things block on "everything is downloaded" rather than a bunch of individual "this is downloaded" responses. I feel like that's slightly poorly worded but I hope it makes sense.

metcoder95 commented 1 year ago

What about semaphores over individual packages? Maybe I'm being too simplistic, but each package might be in three states cached, downloading, pending; and each thread coordinates about what step follows.

Or has been already discarded? 😅

anonrig commented 1 year ago

Maybe I'm being too simplistic, but each package might be in three states cached, downloading, pending; and each thread coordinates about what step follows.

I think this might be the best approach.

pnpm / pacquet

Cache should not be racy #49