pytorch / ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
Other
3.37k stars 566 forks source link

thread safety #118

Closed rwightman closed 5 years ago

rwightman commented 5 years ago

I was just scanning recent changes, including experimental branch to see what the recent activity in this project was. A quick heads up, last two commits on experimental pointed out some thread safety issues.

Specifically looking at: https://github.com/pytorch/ELF/commit/0d755c9b7a86e9f44a148b5d2104efb394e89f8c

I don't believe there was an issue with using unique_ptr itself (if crashing in multi-thread usage was the reason for its removal). There is still a thread safety issue after those changes were made if label2server can be called at the same time as RegServer.

A comment in label2server says 'no lock need, only reading' is false. You're reading a vector, which is not thread safe when another thread can modify that vector. Previously you were dereferencing a unique_ptr without checking as well, which is also not thread safe if another thread is modifying it (which it was, by calling reset). With the current code, you can still have one thread modifying the id vector (and possibly causing a realloc of the underlying memory block) while the other thread is acccessing it.

yuandong-tian commented 5 years ago

Oh, for label2server, there is specific reading/writing pattern here.

The workflow is RegServer -> waitForRegs -> read-only access (via label2server)

During RegServer, all servers are registering themselves. These registrations are protected by register_mutex_, so writing to the concurrent queue is protected. There is a separate waitForRegs that waits until all registrations are done.

https://github.com/pytorch/ELF/blob/experimental/src_cpp/elf/base/game_context.h#L57 https://github.com/pytorch/ELF/blob/experimental/src_cpp/elf/base/context.h#L468

After that there is no writing to the concurrent queue. So RegServer will never overlap with label2server.

yuandong-tian commented 5 years ago

It looks like unique_ptr is not necessary here so I remove it.

rwightman commented 5 years ago

Alright, makes more sense with that workflow as it wasn't clear without a deep understanding of the code when the label2server calls would start happening.

One other comment with the waitfForRegs ... to make it more explicit and clear and make sure there are no possible races, I'd use the same registermutex for the registration counter and wait condition, and not the separate counter_ instance.

Definitely agree unique_ptr wasn't needed, vector by itself is perfect there.

rwightman commented 5 years ago

Look forward to seeing how this project evolves!