gotec / git2net

An Open Source Python package for the extraction of fine-grained and time-stamped co-editing networks from git repositories.
https://git2net.readthedocs.io
GNU Affero General Public License v3.0
53 stars 16 forks source link

git2net stuck when mining a repository #23

Closed Breee closed 3 years ago

Breee commented 3 years ago

Greetings,

We've face an issue when mining several repositories. For example, git2net hangs when mining the commits. 433 commits were processed pretty quick, just the last 2 remaining ones not.

Parallel (4 processes): 100%|██████████████████████████████ | 433/435 [11:29<00:03,  1.74s/it]

I've waited for like an hour, still not moving further.

when I cancel the procedure i get:

Timeout processing commit:  2ed913d95b8a7e1f68788243607e429e67bc602d
Timeout processing commit:  bd3f91e01ac332cbeac14885228589a1d20f4dd6
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 356, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate_coedit.py", line 224, in <module>
  File "generate_coedit.py", line 177, in process_group
    git2net.mine_git_repo(group_git_repo_dir, sqlite_db_file)
  File "/home/bree/repos/analysis/venv/lib/python3.8/site-packages/git2net/extraction.py", line 1570, in mine_git_repo
    _process_repo_parallel(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
  File "/home/bree/repos/analysis/venv/lib/python3.8/site-packages/git2net/extraction.py", line 1176, in _process_repo_parallel
    pbar.update(1)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 736, in __exit__
    self.terminate()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 654, in terminate
    self._terminate()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 692, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 672, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt

Do you have any idea why that can happen? Do we just have to wait much longer, or can it truly get stuck?

gotec commented 3 years ago

Hi Breee,

Yes, this is something that can happen for many repositories. So far, I have mined almost 500 repositories with git2net. I have not yet found a case where git2net truly got stuck in the sense that it ran into an endless loop. However, I have come across some commits that took unreasonably long.

The good news is that all other commits are already stored in the resulting database. However, the unfinished commits might take a while depending on their content. Most likely, the commits are very large, i.e. containing many individual modifications or very large individual modifications.

The git2net tutorial (https://github.com/gotec/git2net/blob/master/TUTORIAL.ipynb) provides some pointers on how you can deal with them. First, you can look at the metadata of the remaining commits using git2net.mining_state_summary(git_repo_dir, sqlite_db_file). This also outputs the commit hash of the remaining commits so you can look them up on GitHub.

In most cases, I found that the commits contained either undetected binary files or were full imports of other projects. Usually, that justified excluding them from my analyses as they do not represent typical development behaviour. However, your use case for git2net might differ from mine :)

If you want to exclude them, you can then set maximum the number of modifications that git2net allows for commits (max_modifications). Alternatively, you can also skip commits that take longer than a specified time (timeout).

If you want to mine them, I'm afraid you'll have to wait until they're done. I have already tried to optimise these cases but found that the performance is limited by the runtime of git blame, especially in these cases.

Best, Christoph

Breee commented 3 years ago

Hey Christoph,

thanks for the detailed information. after waiting a night the mining completed successfully!