ishepard / pydriller

Python Framework to analyse Git repositories
http://pydriller.readthedocs.io/en/latest/
Apache License 2.0
830 stars 141 forks source link

Submodule commit causes an error #260

Open Creadeyh opened 1 year ago

Creadeyh commented 1 year ago

Describe the bug I'm analyzing the github repo avatarify and the commits containing submodule commits such as this one causes an exception to be raised: ValueError: SHA b'72a32a67dee3a67dff76f565551907a2fc7e88e6' could not be resolved, git returned: b'72a32a67dee3a67dff76f565551907a2fc7e88e6missing' The hash in the error being the one of the submodule commit.

To Reproduce I've noticed this issue on 2 occurrences while working with avatarify:

When I use commits = pydriller.Repository(...).traverse_commits() and retrieve either of dmm_unit_size/dmm_unit_complexity/dmm_unit_interfacing:

for commit in commit:
    dmm_unit_size = commit.dmm_unit_size
    dmm_unit_complexity = commit.dmm_unit_complexity
    dmm_unit_interfacing = commit.dmm_unit_interfacing

This is straightforward to patch on my side as I can just try-catch these metrics and replacing them by None if it fails on a commit. However the second case would require a change out of my reach.

When I call the constructor of pydriller.metrics.process.code_churn.CodeChurn

Unless I avoid the problematic commits by navigating with CodeChurn's from_commit/to_commit around them, I simply cannot compute the repo's churn

OS Version: Windows

Creadeyh commented 1 year ago

Same error when calling pydriller.Commit.modified_files

ishepard commented 1 year ago

Hi! The commit you are referring to is in a submodule. To analyze those you need to clone submodules as well, otherwise Git complains that the commit doesn't exists.

As a test, try to run:

git show 72a32a67dee3a67dff76f565551907a2fc7e88e6

in your terminal. You'll see Git returns an error. After you init the submodules that should go away.

Creadeyh commented 1 year ago

I understand that. The issue is that they removed the submodules, so the .gitmodules is empty and init does nothing.

I tried to work around it by retrieving the history of .gitmodules with Git.get_commits_modified_file(), then checkout where .gitmodules was filled, and init-update the submodules from there. However, I still can't access that commit with git show, only if I navigate inside the submodule folder.

And when I call CodeChurn or a DMM metric, it still fails because Pydriller stays in the root folder.

Creadeyh commented 1 year ago

@ishepard Here is the test script I put together if you want to try it out yourself. I'm running Python 3.8 and Pydriller 2.4.1

import subprocess
import tempfile
import os
from typing import List
from pydriller import Repository, Git

tmp_dir = tempfile.mkdtemp()
repo_dir = os.path.join(tmp_dir, "avatarify-python")
process = subprocess.run(["git", "clone", "https://github.com/alievk/avatarify-python"],
                             stdout=subprocess.PIPE,
                             cwd=tmp_dir)
process = subprocess.run(["git", "checkout", "master"],
                             stdout=subprocess.PIPE,
                             cwd=repo_dir)

git: Git = Git(repo_dir)
gitmodules_hist: List[str] = git.get_commits_modified_file(os.path.join(repo_dir, ".gitmodules"), include_deleted_files=True)
for hash in gitmodules_hist:
    git.checkout(hash)
    if os.path.exists(os.path.join(repo_dir, ".gitmodules")):
        print("SUBMODULE UPDATE")
        process = subprocess.run(["git", "submodule", "init"],
                                    stdout=subprocess.PIPE,
                                    cwd=repo_dir)
        process = subprocess.run(["git", "submodule", "update"],
                                    stdout=subprocess.PIPE,
                                    cwd=repo_dir)

git_commits = Repository(repo_dir, only_no_merge=True).traverse_commits()
commits = []
for git_commit in git_commits:

    if git_commit.hash == "80226c1717402f7372a9f82b098619b3836b8bc0":
        print("FOUND BEFORE SUBMODULE 1")
        # Fails here because 80226c references 72a32a
        print(git_commit.dmm_unit_size)
    elif git_commit.hash == "72a32a67dee3a67dff76f565551907a2fc7e88e6":
        print("FOUND SUBMODULE 1")
    elif git_commit.hash == "a5aabda05cc0d0da1e21f21a138e2e5dec01afa0":
        print("FOUND BEFORE SUBMODULE 2")
        # Fails here because a5aabd references 6c1fbf
        print(git_commit.dmm_unit_size)
    elif git_commit.hash == "6c1fbf39690130e2303bcecd3c6126c71cfacf85":
        print("FOUND SUBMODULE 2")