gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.
http://gitpython.readthedocs.org
BSD 3-Clause "New" or "Revised" License
4.6k stars 905 forks source link

Unhandled IndexError when calling .read() on a malformed config file #1887

Closed DaveLak closed 5 months ago

DaveLak commented 6 months ago

Hi, I noticed the fuzzing tests that OSS-Fuzz runs on this project are broken and while I was working on fixing them I believe I came across a minor bug:

The Bug

An Uncaught Python exception: IndexError: string index out of range can be triggered if when trying to call .read() on GitConfigParser if it was initialized with a malformed config file.

Current Behavior

It's easiest to demonstrate, so please consider this example:

import io
from git.config import GitConfigParser

def reproduce_issue():

    malformed_config_content_bytestring = b'[-]\nk:"v\n"'

    problematic_config_file = io.BytesIO(malformed_config_content_bytestring)

    # problematic_config_file looks now like this:
    """
    [-]
    k:"v
    "
    """

    # We have to name the file otherwise we'll trigger
    # `AttributeError: '_io.BytesIO' object has no attribute 'name'` here:
    # https://github.com/gitpython-developers/GitPython/blob/c7675d2cedcd737f20359a4a786e213510452413/git/config.py#L623
    problematic_config_file.name = "fuzzedconfig.config"

    # This is fine
    git_config = GitConfigParser(problematic_config_file)

    # The next line raised an unhandled `IndexError: string index out of range`
    git_config.read()

if __name__ == "__main__":
    reproduce_issue()

Assuming that code is in /path/to/example/config_indexerror_reproduction.py then

python config_indexerror_reproduction.py

produces something akin to:

Traceback (most recent call last):
  File "/path/to/example/config_indexerror_reproduction.py", line 30, in <module>
    reproduce_issue()
  File "/path/to/example/config_indexerror_reproduction.py", line 26, in reproduce_issue
    git_config.read()
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 607, in read
    self._read(file_path, file_path.name)
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 514, in _read
    cursect.setlast(optname, optval + string_decode(line))
                                      ^^^^^^^^^^^^^^^^^^^
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 441, in string_decode
    if v[-1] == "\\":
       ~^^^^
IndexError: string index out of range
My Reproduction Environment Details The reproduction code above was tested on: ```sh Python Version: 3.12.1 (main, Feb 5 2024, 16:23:00) [Clang 15.0.0 (clang-1500.1.0.2.5)] OS Information: macOS-14.4.1-x86_64-i386-64bit Installed Packages: Package Version --------- ------- gitdb 4.0.11 GitPython 3.1.42 pip 24.0 smmap 5.0.1 ``` And the fuzzer environment was: ``` Python Version: 3.8.3 (default, Mar 17 2024, 03:21:27) [Clang 15.0.0 (https://github.com/llvm/llvm-project.git bf7f8d6fa6f460bf0a16ffe OS Information: Linux-6.6.16-linuxkit-x86_64-with-glibc2.2.5 Installed Packages: Package Version ------------------------- ------- altgraph 0.17.4 atheris 2.3.0 coverage 6.3.2 importlib_metadata 7.0.2 gitdb 4.0.11 GitPython 3.1.42 pip 24.0 smmap 5.0.1 packaging 24.0 pyinstaller 5.0.1 setuptools 41.0.1 six 1.15.0 zipp 3.18.1 ```

So, if I'm reading the source correct, it seems like the combination of some header section ([-] above) followed by a key/value assignment that has a value consisting of a double quoted string with a new line inside it confuses the check here which strips the " on the new line: https://github.com/gitpython-developers/GitPython/blob/c7675d2cedcd737f20359a4a786e213510452413/git/config.py#L521-L528

and passes an empty string to string_decode which isn't expecting that when it indexes into it's arg: https://github.com/gitpython-developers/GitPython/blob/c7675d2cedcd737f20359a4a786e213510452413/git/config.py#L454-L455

Expected Behavior

I'd expect an explicitly raised ParsingError similar to how it's handled a little further up: https://github.com/gitpython-developers/GitPython/blob/c7675d2cedcd737f20359a4a786e213510452413/git/config.py#L514-L518

DaveLak commented 6 months ago

On a related note, do the maintainers of this project have any interest in bringing the fuzzing test code into this repo? I searched the repository for prior discussions about it but didn't come across anything.

I'm happy to:

Byron commented 6 months ago

Thanks a lot for this wonderfully complete bug report, it's much appreciated.

Personally, I think the config-parser implementation here is probably very broken and non-conforming, and the fuzzer will have a very good time with it. A problem I see is that reproductions are usually hidden behind a login, so it requires extra work to make these available here so the community has a chance to fix them. However, I am open to give it a shot and see how it goes.

I definitely welcome you to help with maintaining the fuzzer integration, and would appreciate if that would also entail the maintenance of the issues it finds such that all information for reproduction would be contained within them so it's not only maintainers who can fix them.

DaveLak commented 6 months ago

Thanks for the prompt reply 🙂

Silly me for not realizing earlier, but I see gitoxide has been integrated for a while with OSS-Fuzz issues mirrored in GH issues and some good results. Given that context, and this project's maintenance mode status I definitely hear where your coming from.

I think it probably is worth opening up a discussion thread to continue the fuzzing convo for the sake of keeping this issue on topic as I do have some follow-up questions for you related to what you think a fix for this might look like (if any). I went ahead and replied to you here: https://github.com/gitpython-developers/GitPython/discussions/1889#discussion-6444106

DaveLak commented 5 months ago

I started to take a closer look at this and now have a better idea of what's going on here as well as a proof-of-concept fix, but I wanted to get your thoughts on this @Byron, before I put it into PR. CC @EliahKagan in case you have thoughts here as well.

Initially, my instinct was to refer to the git-scm.com docs on Git config syntax and use that to inform the fix here, but (ignoring the fact that there isn't much guidance on value formats relevant to this issue) this portion of @Byron's comment above caught my attention and made re-think things a bit:

I think the config-parser implementation here is probably very broken and non-conforming

Specifically, the "non-conforming" part caught my eye. If GitConfigParser's behavior is already known (and to some degree expected) to be non-conforming to canonical Git's behavior, then I don't think I agree with my initial assessment of the "Expected Behavior" that I described in the PR description (i.e., this should throw an exception) regardless of how canonical Git would handle it. Introducing a "fix" that throws an exception seems unnecessarily disruptive to users that may have otherwise working code that was written around this "non-conforming implementation". Preserving the existing behavior and just making it safer around exception cases seems like a better approach.

So instead, I think a more considerate and generally simpler fix would be to maintain the existing parsing behavior, and just make its checks a little more robust to eliminate the risk of the unhandled IndexError described in this issue.

The diff below shows the change I am considering which would:

Here's the diff:

diff --git a/git/config.py b/git/config.py
index 3ce9b123..91b69531 100644
--- a/git/config.py
+++ b/git/config.py
@@ -452,7 +452,7 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
         e = None  # None, or an exception.

         def string_decode(v: str) -> str:
-            if v[-1] == "\\":
+            if v and v[-1] == "\\":
                 v = v[:-1]
             # END cut trailing escapes to prevent decode error

@@ -508,6 +508,8 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
                     if len(optval) > 1 and optval[0] == '"' and optval[-1] != '"':
                         is_multi_line = True
                         optval = string_decode(optval[1:])
+                    elif len(optval) > 1 and optval[0] == '"' and optval[-1] == '"':
+                        optval = string_decode(optval[1:-1])  # safely decode fully quoted value
                     # END handle multi-line
                     # Preserves multiple values for duplicate optnames.
                     cursect.add(optname, optval)
@@ -524,8 +526,10 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
                     is_multi_line = False
                     line = line[:-1]
                 # END handle quotations
+                if line:  # Ensure line is not empty before decoding
+                    line = string_decode(line)
                 optval = cursect.getlast(optname)
-                cursect.setlast(optname, optval + string_decode(line))
+                cursect.setlast(optname, optval + line)
             # END parse section or option
         # END while reading

What do you think? Does this sound like a reasonable approach here, or is there some context I might be missing?

If this does sound like something worth introducing, let me know and:

  1. I'll add a Unit test.
  2. Remove the workaround for this error in the fuzz_config.py test so it's covered there as well.
  3. Open a PR.
Byron commented 5 months ago

So instead, I think a more considerate and generally simpler fix would be to maintain the existing parsing behavior, and just make its checks a little more robust to eliminate the risk of the unhandled IndexError described in this issue.

The above hits home, and perfectly describes how I imagine handling this. It's an improvement, without going overboard and try to fix something that is systematically broken - a conforming parser can't be made based on an INI parser.

If this does sound like something worth introducing, let me know and:

  1. I'll add a Unit test.
  2. Remove the workaround for this error in the fuzz_config.py test so it's covered there as well.
  3. Open a PR.

This sounds like a plan! Thanks again.

DaveLak commented 5 months ago

PR with the fix: #1908

Only the changes to string_decode were necessary, so that's all I changed.