fs: synchronize updates to cache

skshetry commented 1 year ago

https://github.com/iterative/PyDrive2/pull/286/files/b9fc857613ee18cd08e32c691933a82f5f371e79#r1239272908

Looks like we need to synchronize cache updates, if we expect dirs and ids to be in sync.

shcheklein commented 1 year ago

Thanks @skshetry. I don't think this solves the race condition issue, though (unless RLock guarantees atomicity- I would be concerned about the performance then, and would be surprised). Yes, update becomes thread safe (and as we agreed before, it might be not even as issue - we don't need to try to isolate and serialize "writers", or at least it's not enough). In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

skshetry commented 1 year ago

In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

I think that is okay. I don't think we want to optimize this too much (the worst case is parallel calls when the cache is empty). From dvc's point of view, this should be fine, I think.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

I'd say this should be fixed in dvc side by using a separate filesystem for the files/md5 remote. Then they will automatically get cached.

I would be concerned about the performance then

If you look into _ids_cache, it is behind a RLock, so the intention seems to be protecting writes/reads anyway. It's just that it's not enough here as there are two writes involved. So, I think that was always the intention, and we are already locking every access to the cache.

shcheklein commented 1 year ago

Could you clarify then what is this RLock does for you? We might be thinking about different scenarios.

In my case I mean this:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

it can return cached = True, while dir_ids is still empty AFAIU.

It means that the result of the find might be empty in case where it should not be.

Are we both talking about the same case?

skshetry commented 1 year ago

@shcheklein, yes, that is what RLock ensures.

https://github.com/iterative/PyDrive2/blob/afc2e3353011a77100efa8201e995aa5294db54d/pydrive2/fs/spec.py#L272-L273

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

shcheklein commented 1 year ago

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

skshetry commented 1 year ago

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

From docs on RLock/Lock:

... if this thread already owns the lock, increment the recursion level by one, and return immediately. Otherwise, if another thread owns the lock, block until the lock is unlocked. Once the lock is unlocked (not owned by any thread), then grab ownership, set the recursion level to one, and return. If more than one thread is blocked waiting until the lock is unlocked, only one at a time will be able to grab ownership of the lock. There is no return value in this case.

shcheklein commented 1 year ago

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

In the snippet that I've shared other thread is not using that function:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

So, even if there is a thread A that is doing an update under RLock, and thread B that is waiting to do that update, there could be a thread C that is executing the very top of the find. When thread B is in the middle of an update (just updated dirs) it can be asked (?) to yield it's control to a thread C. That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

skshetry commented 1 year ago

That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once. Or, maybe i am missing something here?

shcheklein commented 1 year ago

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once

That's my point / concern. I don't think there is a guarantee for that. There are lot of operations involved in updating all these maps, reading them back, etc - all of those are not atomic. You have to put readers and writes under lock or do partially-lockless (and there is a good reason for this), extra care is required in that case

shcheklein commented 1 year ago

This is guaranteed by the lock in this function,

Sorry, we go a bit in circles, but could you elaborate / confirm this? (I guess that's main thing to discuss here). The way I see it (granted, I'm not deeply familiar with CPython and haven't looked at it yet), there is not guarantee for that.

There is guarantee that no two modifications are happening simultaneously to the cache. There is no guarantee that while someone updates it, there is no way to read and get a partial update.

skshetry commented 1 year ago

Okay, I now see what you are saying. I’ll have to think about it more.

skshetry commented 1 year ago

Closing for now.

shcheklein commented 1 year ago

hey @skshetry - do we need at least a ticket for this or do you plan to address it? I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt?

shcheklein commented 1 year ago

@skshetry friendly ping.

skshetry commented 1 year ago

@shcheklein, I have to wrap all reads and writes with RLock. Will try to take a look this week.

I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt

I don't think it's worth it to do this. It's 2 API calls that we are saving.

shcheklein commented 1 year ago

@skshetry sounds good.

I don't think it's worth it to do this. It's 2 API calls that we are saving.

I'm not worried about 2 API calls per session. I'm more worried about lock performance and general complexity of code.

Let's a first step please create a ticket for this, for visibility and so that we don't forget about it.

iterative / PyDrive2

fs: synchronize updates to cache #289