gotec / git2net

An Open Source Python package for the extraction of fine-grained and time-stamped co-editing networks from git repositories.
https://git2net.readthedocs.io
GNU Affero General Public License v3.0
53 stars 16 forks source link

KeyboardInterrupt when using Timeout #32

Closed gotec closed 1 year ago

gotec commented 2 years ago

Describe the bug KeyboardInterrupt error is output while the timout option is used.

Expected behavior A message Timeout processing commit: <hash> is output but no KeyboardInterrupt is printed.

wschuell commented 1 year ago

I had a quick look, it seems that it is happening only when the process is interrupted while being in a call to the Git lib, which spawns a subprocess to run the git command (as opposed to using pygit2 which relies directly on the libgit2 C lib, if I understood well). What is written to stderr is inherited from subprocess.Popen (Exception ignored in: <function Git.__del__ at 0x...>).

Track 1 for solution: It is possible through the args of git.cmd.Git.execute() method to pass args to subprocess.Popen (see https://gitpython.readthedocs.io/en/stable/reference.html#git.cmd.Git.execute ), and in particular how to redirect stderr (potentially to devnull), but you do not want to silence any error either. Maybe piping it into an 'intermediate stderr' (repiped into the main) silenced if the stream starts with 'Exception ignored'?

Track 2 for solution: There is a kill_after_timeout arg in git.cmd.Git.execute as well, but they indicate it might harm the state of the repo, and you would have to track the timeout value for each call, using e.g. time.time() compared to the start of the global process and the initial timeout value (as opposed to the alarm where it s just one sleep), and let a bit of room for the timeout processed to go through before the alarm triggers the global timeout behavior.

TBN: You may want to print the KeyboardInterrupt and not a time out message if it is an actual KeyboardInterrupt, which happens in parallel mode but in serial mode it would be treated as the end of the alarm (I did not check). This would be another issue but it is related.

NB: For easy reproduction of the issue (with no_of_processes so that one can test easily also the serial case):

import git2net
import os
import shutil

github_url = 'gotec/git2net'
git_repo_dir = 'git2net_test'
sqlite_db_file = 'git2net.db'

git2net.mine_github(github_url, git_repo_dir, sqlite_db_file,no_of_processes=2,timeout=0.001)
gotec commented 1 year ago

Hi William

Thanks a lot for reminding me of this!

Redirecting the stderr via an argument of git.cmd.Git.execute() to subprocess.Popen() is unfortunately not possible, as it is automatically set by git.cmd.Git.execute() and cannot be overwritten.

After some exploration, I found a way to achieve the same using contextlib.redirect_stderr(). While doing so, I also replaced the previous timeout implementation using the ThreadingTimeout context manager from stopit. As you also suggested, I then parsed the redirected stderr manually, raising it as an exception if it is not an ignored exception.

The current solution has many benefits over the old code, including:

  1. Getting rid of all the try-except statements.
  2. Eliminating the "ignored exception in" messages.
  3. Allowing a KeyboardInterrupt to be differentiated from a TimeoutException.

I have committed these changes with 9c88c8c211a98641a88248a6fb6abd7f92ac04ff, so feel free to try them out! From my understanding, they should work on all platforms (i.e., retaining Windows compatibility (which was the reason why I switched from signals to threads in 8a5d405bc8263b6f482f9e6c38534023de8956e5), however I still have to validate this.

Cheers Christoph

wschuell commented 1 year ago

Cool solution, thanks! I tested with the example above, I confirm that I get the expected behavior (for all timeouts and for the KeyboardInterrupt), under Linux.