Closed rwightman closed 5 years ago
Oh, for label2server
, there is specific reading/writing pattern here.
The workflow is RegServer
-> waitForRegs
-> read-only access (via label2server
)
During RegServer
, all servers are registering themselves. These registrations are protected by register_mutex_
, so writing to the concurrent queue is protected. There is a separate waitForRegs
that waits until all registrations are done.
https://github.com/pytorch/ELF/blob/experimental/src_cpp/elf/base/game_context.h#L57 https://github.com/pytorch/ELF/blob/experimental/src_cpp/elf/base/context.h#L468
After that there is no writing to the concurrent queue. So RegServer
will never overlap with label2server
.
It looks like unique_ptr
is not necessary here so I remove it.
Alright, makes more sense with that workflow as it wasn't clear without a deep understanding of the code when the label2server calls would start happening.
One other comment with the waitfForRegs ... to make it more explicit and clear and make sure there are no possible races, I'd use the same registermutex for the registration counter and wait condition, and not the separate counter_ instance.
Definitely agree unique_ptr wasn't needed, vector by itself is perfect there.
Look forward to seeing how this project evolves!
I was just scanning recent changes, including experimental branch to see what the recent activity in this project was. A quick heads up, last two commits on experimental pointed out some thread safety issues.
Specifically looking at: https://github.com/pytorch/ELF/commit/0d755c9b7a86e9f44a148b5d2104efb394e89f8c
I don't believe there was an issue with using unique_ptr itself (if crashing in multi-thread usage was the reason for its removal). There is still a thread safety issue after those changes were made if label2server can be called at the same time as RegServer.
A comment in label2server says 'no lock need, only reading' is false. You're reading a vector, which is not thread safe when another thread can modify that vector. Previously you were dereferencing a unique_ptr without checking as well, which is also not thread safe if another thread is modifying it (which it was, by calling reset). With the current code, you can still have one thread modifying the id vector (and possibly causing a realloc of the underlying memory block) while the other thread is acccessing it.