paulgazz / kmax

A collection of analysis tools for Kconfig and Kbuild constraints.
42 stars 21 forks source link

SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

Open lolrepeatlol opened 4 weeks ago

lolrepeatlol commented 4 weeks ago

Summary

Steps to reproduce

Steps followed

To get a repaired configuration file for a commit range, I followed the steps here:

  1. Clone the Linux kernel with git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and enter the directory.
  2. Create a diff from a range of patches using git diff {commit1}..{commit2} > patchset.diff. I initially encountered this issue with the range 52afb15e9d9a021ab6eec923a087ec9f518cb713 to 0253d718a070ba109046299847fe8f3cf7568c3c.
  3. Check out the source code at the latest of the two commits and create a kernel configuration file with a command like make defconfig.
  4. Run klocalizer with klocalizer --repair .config -a x86_64 --include-mutex patchset.diff --verbose.

    What I expected to happen

  5. I expected klocalizer to repair the kernel configuration file.

    What actually happened

  6. klocalizer runs into an AssertionError:
    DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvif/object.c" to get constrained line ranges.
    DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvif/object.c" found 22 unconstrained lines, 0 lines are remaining for presence condition analysis.
    DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" to get constrained line ranges.
    DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" found 2 unconstrained lines, 0 lines are remaining for presence condition analysis.
    DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" to get constrained line ranges.
    DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" found 405 unconstrained lines, 0 lines are remaining for presence condition analysis.
    DEBUG: Doing syntax analysis on "drivers/gpu/drm/omapdrm/omap_dmm_tiler.c" to get constrained line ranges.
    Traceback (most recent call last):
    File "/home/alexei/IDEProjects/PyCharmProjects/kmax/venv/bin/klocalizer", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
    File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 1740, in <module>
    klocalizerCLI()
    File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 815, in klocalizerCLI
    root_cb = SyntaxAnalysis.get_conditional_blocks_of_file(srcfile_fullpath)
    File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 807, in get_conditional_blocks_of_file
    cb = SyntaxAnalysis.get_conditional_blocks(content, line_count)
    File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 782, in get_conditional_blocks
    assert len(stack) == 1
    AssertionError
    (venv) alexei@turing:~/LinuxKernels/kmax_stress_testing/linux_rand500$

Additional information

While debugging the code, I made use of debug print statements in analyze_c_tokens and focused on when in_quotes would change. The code is seen below:

    def analyze_c_tokens(tokens_w_line_nums):
        """
        TODO: document
        determines what kind of code each token is (c, comment, or preprocessor).
        returns a map between line number and a list of (token mapped to type of code)
        """
        analyzed_tokens = {}

        in_quotes = False
        in_single_line_comment = False
        in_preprocessor = False
        prev_line_num = 0
        in_comment = False
        continued_preprocessor = False

        for token, line_num in tokens_w_line_nums:
            if len(token) < 1:
                continue

            print(f"\nDEBUG processing tkn '{token}', line {line_num}")
            print(f"  in_quotes current state: {in_quotes}")

            if line_num == prev_line_num:
                pass
            else:
                print(f"\nDEBUG state @ line {line_num}:")
                print(f"  prev preprocessor state: {in_preprocessor}")
                print(f"  contd preprocessor: {continued_preprocessor}")

                if not continued_preprocessor:
                    in_preprocessor = False
                    print(f"  resetting preprocessor state to {in_preprocessor}")
                if in_single_line_comment:
                    in_comment = False
                    in_single_line_comment = False
                analyzed_tokens[line_num] = []

            if token[0] == '#':
                print(f"DEBUG # check @ line {line_num}:")
                print(f"  in_comment: {in_comment}")
                print(f"  in_quotes: {in_quotes}")
                print(f"  token[0] == '#': {token[0] == '#'}")

            # preprocessor check
            if (not in_comment) and (not in_quotes) and token[0] == '#':
                print(f"\nDEBUG found # @ line {line_num}:")
                print(f"  prev state: {in_preprocessor}")
                in_preprocessor = True
                print(f"  new state: {in_preprocessor}")

            print(f"  before quote check, in_quotes: {in_quotes}")

            if (not in_preprocessor) and (not in_comment) and ("\"" in token):
                print(f"DEBUG quote found in token '{token}' @ line {line_num}")
                print(f"  current in_quotes: {in_quotes}")
                in_quotes = not in_quotes
                print(f"  new in_quotes: {in_quotes}")

            print(f"  after quote check, in_quotes: {in_quotes}")

            if (not in_quotes) and (not in_comment) and ("//" in token):
                in_single_line_comment = True
                in_comment = True

            if (not in_quotes) and (not in_comment) and ("/*" in token):
                in_comment = True

            # add the token with code type
            current_type = "preprocessor" if in_preprocessor else ("comment" if in_comment else "c")
            print(f"  token: {token}, type: {current_type}")

            if in_comment:
                analyzed_tokens[line_num].append({token: "comment"})
            elif in_preprocessor:
                # handle case where no space between directive and parenthesis
                found_directive = False  # track if directive found
                for directive in ['if', 'ifdef', 'ifndef', 'elif', 'else', 'endif']:
                    if token.startswith(directive + '('):
                        directive_token = directive
                        remaining_token = token[len(directive):]  # capture remaining part by slicing at length of directive
                        analyzed_tokens[line_num].append({directive_token: "preprocessor"})  # add directive
                        if remaining_token:
                            analyzed_tokens[line_num].append({remaining_token: "preprocessor"})  # add remaining token
                            found_directive = True
                        break

                # if no directive found: just add token as whole
                if not found_directive:
                    analyzed_tokens[line_num].append({token: "preprocessor"})
            else:
                analyzed_tokens[line_num].append({token: "c"})

            if in_comment and ("*/" in token):
                in_comment = False

            if token == '\\':
                continued_preprocessor = True
                print(f" continuation found! setting continued_preprocessor: {continued_preprocessor}")
            else:
                continued_preprocessor = False

            print(f"DEBUG loop end. in_quotes: {in_quotes}")
            prev_line_num = line_num
        return analyzed_tokens

For context, I had inserted other debug print statements elsewhere, especially in get_conditional_blocks(), as I had initially believed the preprocessor conditional wasn't being closed. However, this was not the case, so I investigated sister functions like analyze_c_tokens(). The actual problem stems from a string filled with many special characters within drivers/gpu/drm/omapdrm/omap_dmm_tiler.c:

...
/*
 * debugfs support
 */

#ifdef CONFIG_DEBUG_FS //line 991

static const char *alphabet = "abcdefghijklmnopqrstuvwxyz"
                "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
static const char *special = ".,:;'\"`~!^-+"; //this is the problem

static void fill_map(char **map, int xdiv, int ydiv, struct tcm_area *a,
                            char c, bool ovw)
...
    }

error:
    kfree(map);
    kfree(global_map);

    return 0;
}
#endif  //associated endif at line 1159

The debug output is available below. In particular, note how the *alphabet strings are tokenized fully, while *special is separated into different pieces, causing issues within analyze_c_tokens(). Finally, note how the endif preprocessor conditional is believed to be in_quotes and C code.

DEBUG processing tkn 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', line 994  # the tokenizer properly accounts for a FULL string
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '"', line 994
  in_quotes current state: True
  before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 994
  current in_quotes: True
  new in_quotes: False
  after quote check, in_quotes: False
  token: ", type: c
DEBUG loop end. in_quotes: False

... [continued] ...

DEBUG processing tkn '=', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
  after quote check, in_quotes: False
  token: =, type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '"', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
  current in_quotes: False
  new in_quotes: True
  after quote check, in_quotes: True
  token: ", type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '.,:;', line 995  # one part of the string
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: .,:;, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn ''', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ', type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '\', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: \, type: c
 continuation found! setting continued_preprocessor: True  # another issue, likely caused by this bug?
DEBUG loop end. in_quotes: True

DEBUG processing tkn '"', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 995
  current in_quotes: True
  new in_quotes: False
  after quote check, in_quotes: False
  token: ", type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '`~!^-+', line 995  # another part of the string
  in_quotes current state: False
  before quote check, in_quotes: False
  after quote check, in_quotes: False
  token: `~!^-+, type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '"', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
  current in_quotes: False
  new in_quotes: True
  after quote check, in_quotes: True
  token: ", type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn ';', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ;, type: c
DEBUG loop end. in_quotes: True   # NOTICE: still believed to be "in_quotes"

... [continued] ...

DEBUG processing tkn '#', line 1159
  in_quotes current state: True

DEBUG state @ line 1159:
  prev preprocessor state: False
  contd preprocessor: False
  resetting preprocessor state to False
DEBUG # check @ line 1159:
  in_comment: False
  in_quotes: True
  token[0] == '#': True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: #, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn 'endif', line 1159
  in_quotes current state: True
  before quote check, in_quotes: True  # should NOT be "in_quotes"
  after quote check, in_quotes: True
  token: endif, type: c  # should NOT be "C code"
DEBUG loop end. in_quotes: True