Closed skshetry closed 1 year ago
Thanks @skshetry. I don't think this solves the race condition issue, though (unless RLock guarantees atomicity- I would be concerned about the performance then, and would be surprised). Yes, update becomes thread safe (and as we agreed before, it might be not even as issue - we don't need to try to isolate and serialize "writers", or at least it's not enough). In this case you still have an unprotected reader in the find
, which still can get different sets from two subsequent calls.
We can expand the cache init to include necessary roots there (file/md5/00
.... etc) the same way it initializes 00
etc now.
In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.
I think that is okay. I don't think we want to optimize this too much (the worst case is parallel calls when the cache is empty). From dvc's point of view, this should be fine, I think.
We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.
I'd say this should be fixed in dvc side by using a separate filesystem for the files/md5
remote. Then they will automatically get cached.
I would be concerned about the performance then
If you look into _ids_cache
, it is behind a RLock
, so the intention seems to be protecting writes/reads anyway. It's just that it's not enough here as there are two writes involved.
So, I think that was always the intention, and we are already locking every access to the cache.
Could you clarify then what is this RLock does for you? We might be thinking about different scenarios.
In my case I mean this:
cached = base in self._ids_cache["dirs"]
...
dir_ids = [self._ids_cache["ids"].copy()]
it can return cached = True
, while dir_ids
is still empty AFAIU.
It means that the result of the find
might be empty in case where it should not be.
Are we both talking about the same case?
@shcheklein, yes, that is what RLock ensures.
If there is base
in self._ids_cache['dirs']
, there must be self._ids_cache['ids']
for that base
too.
If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.
could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.
could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.
The other thread will be blocked and won't enter this function, until the owning thread releases the lock.
From docs on RLock/Lock:
... if this thread already owns the lock, increment the recursion level by one, and return immediately. Otherwise, if another thread owns the lock, block until the lock is unlocked. Once the lock is unlocked (not owned by any thread), then grab ownership, set the recursion level to one, and return. If more than one thread is blocked waiting until the lock is unlocked, only one at a time will be able to grab ownership of the lock. There is no return value in this case.
The other thread will be blocked and won't enter this function, until the owning thread releases the lock.
In the snippet that I've shared other thread is not using that function:
cached = base in self._ids_cache["dirs"]
...
dir_ids = [self._ids_cache["ids"].copy()]
So, even if there is a thread A that is doing an update under RLock, and thread B that is waiting to do that update, there could be a thread C that is executing the very top of the find
. When thread B is in the middle of an update (just updated dirs) it can be asked (?) to yield it's control to a thread C. That can read the self._ids_cache["dirs"]
while self._ids_cache["ids"].
is still empty.
That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.
But if we synchronized updates, there is not going to be a possibility of having dirs
but not ids
. They are updated at once. Or, maybe i am missing something here?
But if we synchronized updates, there is not going to be a possibility of having
dirs
but notids
. They are updated at once
That's my point / concern. I don't think there is a guarantee for that. There are lot of operations involved in updating all these maps, reading them back, etc - all of those are not atomic. You have to put readers and writes under lock or do partially-lockless (and there is a good reason for this), extra care is required in that case
This is guaranteed by the lock in this function,
Sorry, we go a bit in circles, but could you elaborate / confirm this? (I guess that's main thing to discuss here). The way I see it (granted, I'm not deeply familiar with CPython and haven't looked at it yet), there is not guarantee for that.
There is guarantee that no two modifications are happening simultaneously to the cache. There is no guarantee that while someone updates it, there is no way to read and get a partial update.
Okay, I now see what you are saying. I’ll have to think about it more.
Closing for now.
hey @skshetry - do we need at least a ticket for this or do you plan to address it? I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt?
@skshetry friendly ping.
@shcheklein, I have to wrap all reads and writes with RLock
. Will try to take a look this week.
I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt
I don't think it's worth it to do this. It's 2 API calls that we are saving.
@skshetry sounds good.
I don't think it's worth it to do this. It's 2 API calls that we are saving.
I'm not worried about 2 API calls per session. I'm more worried about lock performance and general complexity of code.
Let's a first step please create a ticket for this, for visibility and so that we don't forget about it.
https://github.com/iterative/PyDrive2/pull/286/files/b9fc857613ee18cd08e32c691933a82f5f371e79#r1239272908
Looks like we need to synchronize cache updates, if we expect
dirs
andids
to be in sync.