Fix mkdir race condition in LooseObjectDB.store

EliahKagan commented 10 months ago

Fixes #85

This replaces the conditional call to os.mkdir that raises an unintended FileExistsError if the directory is created between the check and the os.mkdir call, using a single os.makedirs call instead, with exist_ok=True.

This way, we attempt creation in a way that produces no error if the directory is already present, while still raising FileExistsError if a non-directory filesystem entry (such as a regular file) is present where we want the directory to be. This is the advantage of this approach over the approach of swallowing FileExistError as suggested in #85.

Note, however, that os.makedirs behaves like mkdir -p: it attempts to create parent directories (and their parents, etc.) if they do not already exist. So it should only be used if that is acceptable in this case. I am not aware of a reason it wouldn't be, but I am not very familiar with gitdb.

So that aspect of the situation deserves special consideration in reviewing this PR. I'd be pleased to change the approach if os.makdirs is judged not suitable here. I think the approach suggested in #85 is reasonable, and it can be made more robust by checking that the directory exists after the creation attempt (or in other ways).

The code was under test: that line is exercised in TestExamples.test_base, TestGitDB.test_writing, TestLooseDB.test_basics, and TestObjDBPerformance.test_large_data_streaming. However, no test catches the race condition this fixes, and I have not added one.

Testing that the race condition does not occur in the specific way as before by accessing and calling the same functions as before in the same order would be easy, but it would be more of an illusion of a regression test than a useful test. Testing by trying to brute-force a race condition, without modifying the operation of the code for the test, would work but the tests would take a very long time to run. Testing it in a way that is fairly robust against new ways of reintroducing the race condition and that is not too slow should be possible, but I don't know of a good way to do it; everything I've thought of would be complicated, and possibly make running the test in a debugger like pdb infeasible. So I have not added a regression test for this bug. However, if it is considered important to have one, then I can consider the matter further.

EliahKagan commented 10 months ago

Thanks--if I contemplate significant changes to, or related to, gitdb, I'll being by looking into whether I can remove GitPython's dependence on it or vendor the specific parts that GitPython needs.

In this case, I just noticed that I was aware of a way to fix the race condition that might be acceptable. so I figured I'd open a PR. Although there are some other changes I may want to propose to the CI workflow in this repository, I'm not sure I really know enough to fix the most important gitdb-related issues, in GitPython or in gitdb itself. As my familiarity with the GitPython codebase increases, that might change.

Byron commented 10 months ago

Although there are some other changes I may want to propose to the CI workflow in this repository, I'm not sure I really know enough to fix the most important gitdb-related issues, in GitPython or in gitdb itself. As my familiarity with the GitPython codebase increases, that might change.

I would absolutely love if I could direct your incredible skill and energy away from the GitDB CI and towards making it unnecessary entirely :). It's probably more of a refactoring task, albeit a complex one.

As my familiarity with the GitPython codebase increases, that might change.

It's a great gift to have you contribute to this project, and I love the idea to have more of that. The project, to my mind, has significant issues with quality and some limitations stem from these. The greatest problems come from incorrect handling of encodings, both for paths and for data, along with many naive implementations of some git data-structures (see Index). With over half a billion downloads per year GitPython is more influential than I feel comfortable with given its current state, and your work will be majorly impactful. I also hope your contribution experience is rewarding and pleasant so you can continue this awesome work.

EliahKagan commented 10 months ago

Thanks! Based on this, the only further gitdb CI pull request I will open now is one to re-add Python 3.7 support, as you indicated should be done in a few of the comments in https://github.com/gitpython-developers/GitPython/pull/1654. (That will include CI, but also changing the lower bound back to 3.7 in setup.py. It's very simple, because it's just a revert of one commit.)

gitpython-developers / gitdb

Fix mkdir race condition in LooseObjectDB.store #91