tbenthompson / cppimport

Import C++ files directly from Python!
MIT License
1.18k stars 67 forks source link

Race condition when multiple processes try to compile a module at once #67

Closed joshlk closed 2 years ago

joshlk commented 2 years ago

Hi,

Great package by the way!

I've encountered an issue when multiple processes are spawned that all race to compile the same module. This can also occur when multiple processes are spawned on different hosts and share the same network filesystem. Such a situation is common when distributing work between multiple processes or hosts for AI or data analytics.

Here is a demonstration (in the shell):

echo '// cppimport
#include <pybind11/pybind11.h>

namespace py = pybind11;

int square(int x) {
    return x * x;
}

PYBIND11_MODULE(somecode, m) {
    m.def("square", &square);
}
/*
<%
setup_pybind11(cfg)
%>
*/' > somecode.cpp

echo 'import cppimport.import_hook
import somecode
somecode.square(9)' > test.py

rm somecode.cpython-*

for i in {1..100}; do python3 test.py & done

On my system around 4 out of 100 processes exit in an error. The shell output includes:

error: could not delete '/localdata/joshl/sandbox/somecode.cpython-36m-x86_64-linux-gnu.so': No such file or directory
...
Exit 1                  python3 test.py
...
Bus error               (core dumped) python3 test.py

These errors don't appear when the binary already exists.


To mitigate this issue in our applications we have used a file lock so that only one process attempts to compile the module at one time. A process first checks if the binary file exists, otherwise attempts to obtain the file lock. If it can't obtain the lock it waits until either the binary exists, can obtain the file lock or times out. Here is an example how it can be done (app code):

from cppimport.checksum import is_checksum_valid

binary_path = module_data['ext_path']
lock_path = binary_path + '.lock'

t = time()

while not (os.path.exists(binary_path) and is_checksum_valid(module_data)) and time() - t < timeout:
    try:
        with FileLock(lock_path, timeout=1):
            if os.path.exists(binary_path) and is_checksum_valid(module_data_new_path):
                break
            # BUILD BINARY
            template_and_build(filepath, module_data)
    except Timeout:
        logging.debug(f'{os.getpid()}: Could not obtain lock')
        sleep(1)

if not (os.path.exists(binary_path) and is_checksum_valid(module_data_new_path)):
    raise Exception(
        f'Could not compile binary as lock already taken and timed out. Lock file will be deleted: {lock_path}')

if os.path.exists(lock_path):
    with suppress(OSError):
        os.remove(lock_path)

It would be great if we could upstream the above to cppimport to prevent the race condition errors. If you are happy with this solution I could contribute the above to the appropriate place in cppimport.

tbenthompson commented 2 years ago

Thanks for the interesting issue!!

Overall, I really like this and would like to implement something like this since it seems clearly useful.

The only thing that makes me hesitate is that it seems like a problem that is inherent to build systems - if you run the same build with the same build system concurrently, you are bound to run into a problem. So, why should cppimport fix a problem that essentially comes from the underlying tools? My response to this question would be: cppimport is attempting to make c++ files behave a little bit more like Python modules, while accepting that the correspondence is very leaky. So, any step to make that correspondence a little less leaky is a good thing.

Would you be interested in putting together a pull request for this? I'm generally very light touch on code reviews since I want to encourage contributions. So, it won't be an arduous process! =)

tbenthompson commented 2 years ago

Hi @joshlk, just wanted to check in if this is something you'd be interested in making a PR for! Are you still using your file lock solution? Any issues with that approach crop up over the last couple months?

joshlk commented 2 years ago

Yes it is. Im just trying to get approval from my work but it should be sorted soon

tbenthompson commented 2 years ago

Awesome!

joshlk commented 2 years ago

Sorry it's taken so long - I had to jump through a bunch of internal hoops. Here is a PR: #71