kmax's built-in, SuperC-less tokenizer can incorrectly mark the end of strings, particularly in cases where strings have a significant number of special characters.
This happens because the current tokenizer, get_tokens(), is somewhat simplistic in how it interprets some inputs, and ultimately produces outputs that are too granular for analyze_c_tokens to correctly tag.
One particular issue, seen here with string detection, can compound on itself. In one instance viewed below, preprocessor directives are ultimately considered "c" tokens rather than "preprocessor" tokens because they're believed to be within a string.
In this instance, kmax's failure to account for preprocessor directives becomes a problem when the conditional at the top of the stack isn't popped, as the #endif directive isn't properly accounted for.
This failure leads to an AssertionError when attempting to find another conditional to match to.
Steps to reproduce
Steps followed
To get a repaired configuration file for a commit range, I followed the steps here:
Clone the Linux kernel with git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and enter the directory.
Create a diff from a range of patches using git diff {commit1}..{commit2} > patchset.diff. I initially encountered this issue with the range 52afb15e9d9a021ab6eec923a087ec9f518cb713 to 0253d718a070ba109046299847fe8f3cf7568c3c.
Check out the source code at the latest of the two commits and create a kernel configuration file with a command like make defconfig.
Run klocalizer with klocalizer --repair .config -a x86_64 --include-mutex patchset.diff --verbose.
What I expected to happen
I expected klocalizer to repair the kernel configuration file.
What actually happened
klocalizer runs into an AssertionError:
DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvif/object.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvif/object.c" found 22 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" found 2 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" found 405 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/omapdrm/omap_dmm_tiler.c" to get constrained line ranges.
Traceback (most recent call last):
File "/home/alexei/IDEProjects/PyCharmProjects/kmax/venv/bin/klocalizer", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 1740, in <module>
klocalizerCLI()
File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 815, in klocalizerCLI
root_cb = SyntaxAnalysis.get_conditional_blocks_of_file(srcfile_fullpath)
File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 807, in get_conditional_blocks_of_file
cb = SyntaxAnalysis.get_conditional_blocks(content, line_count)
File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 782, in get_conditional_blocks
assert len(stack) == 1
AssertionError
(venv) alexei@turing:~/LinuxKernels/kmax_stress_testing/linux_rand500$
Additional information
While debugging the code, I made use of debug print statements in analyze_c_tokens and focused on when in_quotes would change. The code is seen below:
def analyze_c_tokens(tokens_w_line_nums):
"""
TODO: document
determines what kind of code each token is (c, comment, or preprocessor).
returns a map between line number and a list of (token mapped to type of code)
"""
analyzed_tokens = {}
in_quotes = False
in_single_line_comment = False
in_preprocessor = False
prev_line_num = 0
in_comment = False
continued_preprocessor = False
for token, line_num in tokens_w_line_nums:
if len(token) < 1:
continue
print(f"\nDEBUG processing tkn '{token}', line {line_num}")
print(f" in_quotes current state: {in_quotes}")
if line_num == prev_line_num:
pass
else:
print(f"\nDEBUG state @ line {line_num}:")
print(f" prev preprocessor state: {in_preprocessor}")
print(f" contd preprocessor: {continued_preprocessor}")
if not continued_preprocessor:
in_preprocessor = False
print(f" resetting preprocessor state to {in_preprocessor}")
if in_single_line_comment:
in_comment = False
in_single_line_comment = False
analyzed_tokens[line_num] = []
if token[0] == '#':
print(f"DEBUG # check @ line {line_num}:")
print(f" in_comment: {in_comment}")
print(f" in_quotes: {in_quotes}")
print(f" token[0] == '#': {token[0] == '#'}")
# preprocessor check
if (not in_comment) and (not in_quotes) and token[0] == '#':
print(f"\nDEBUG found # @ line {line_num}:")
print(f" prev state: {in_preprocessor}")
in_preprocessor = True
print(f" new state: {in_preprocessor}")
print(f" before quote check, in_quotes: {in_quotes}")
if (not in_preprocessor) and (not in_comment) and ("\"" in token):
print(f"DEBUG quote found in token '{token}' @ line {line_num}")
print(f" current in_quotes: {in_quotes}")
in_quotes = not in_quotes
print(f" new in_quotes: {in_quotes}")
print(f" after quote check, in_quotes: {in_quotes}")
if (not in_quotes) and (not in_comment) and ("//" in token):
in_single_line_comment = True
in_comment = True
if (not in_quotes) and (not in_comment) and ("/*" in token):
in_comment = True
# add the token with code type
current_type = "preprocessor" if in_preprocessor else ("comment" if in_comment else "c")
print(f" token: {token}, type: {current_type}")
if in_comment:
analyzed_tokens[line_num].append({token: "comment"})
elif in_preprocessor:
# handle case where no space between directive and parenthesis
found_directive = False # track if directive found
for directive in ['if', 'ifdef', 'ifndef', 'elif', 'else', 'endif']:
if token.startswith(directive + '('):
directive_token = directive
remaining_token = token[len(directive):] # capture remaining part by slicing at length of directive
analyzed_tokens[line_num].append({directive_token: "preprocessor"}) # add directive
if remaining_token:
analyzed_tokens[line_num].append({remaining_token: "preprocessor"}) # add remaining token
found_directive = True
break
# if no directive found: just add token as whole
if not found_directive:
analyzed_tokens[line_num].append({token: "preprocessor"})
else:
analyzed_tokens[line_num].append({token: "c"})
if in_comment and ("*/" in token):
in_comment = False
if token == '\\':
continued_preprocessor = True
print(f" continuation found! setting continued_preprocessor: {continued_preprocessor}")
else:
continued_preprocessor = False
print(f"DEBUG loop end. in_quotes: {in_quotes}")
prev_line_num = line_num
return analyzed_tokens
For context, I had inserted other debug print statements elsewhere, especially in get_conditional_blocks(), as I had initially believed the preprocessor conditional wasn't being closed. However, this was not the case, so I investigated sister functions like analyze_c_tokens(). The actual problem stems from a string filled with many special characters within drivers/gpu/drm/omapdrm/omap_dmm_tiler.c:
...
/*
* debugfs support
*/
#ifdef CONFIG_DEBUG_FS //line 991
static const char *alphabet = "abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
static const char *special = ".,:;'\"`~!^-+"; //this is the problem
static void fill_map(char **map, int xdiv, int ydiv, struct tcm_area *a,
char c, bool ovw)
...
}
error:
kfree(map);
kfree(global_map);
return 0;
}
#endif //associated endif at line 1159
The debug output is available below. In particular, note how the *alphabet strings are tokenized fully, while *special is separated into different pieces, causing issues within analyze_c_tokens(). Finally, note how the endif preprocessor conditional is believed to be in_quotes and C code.
DEBUG processing tkn 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', line 994 # the tokenizer properly accounts for a FULL string
in_quotes current state: True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789, type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn '"', line 994
in_quotes current state: True
before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 994
current in_quotes: True
new in_quotes: False
after quote check, in_quotes: False
token: ", type: c
DEBUG loop end. in_quotes: False
... [continued] ...
DEBUG processing tkn '=', line 995
in_quotes current state: False
before quote check, in_quotes: False
after quote check, in_quotes: False
token: =, type: c
DEBUG loop end. in_quotes: False
DEBUG processing tkn '"', line 995
in_quotes current state: False
before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
current in_quotes: False
new in_quotes: True
after quote check, in_quotes: True
token: ", type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn '.,:;', line 995 # one part of the string
in_quotes current state: True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: .,:;, type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn ''', line 995
in_quotes current state: True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: ', type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn '\', line 995
in_quotes current state: True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: \, type: c
continuation found! setting continued_preprocessor: True # another issue, likely caused by this bug?
DEBUG loop end. in_quotes: True
DEBUG processing tkn '"', line 995
in_quotes current state: True
before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 995
current in_quotes: True
new in_quotes: False
after quote check, in_quotes: False
token: ", type: c
DEBUG loop end. in_quotes: False
DEBUG processing tkn '`~!^-+', line 995 # another part of the string
in_quotes current state: False
before quote check, in_quotes: False
after quote check, in_quotes: False
token: `~!^-+, type: c
DEBUG loop end. in_quotes: False
DEBUG processing tkn '"', line 995
in_quotes current state: False
before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
current in_quotes: False
new in_quotes: True
after quote check, in_quotes: True
token: ", type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn ';', line 995
in_quotes current state: True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: ;, type: c
DEBUG loop end. in_quotes: True # NOTICE: still believed to be "in_quotes"
... [continued] ...
DEBUG processing tkn '#', line 1159
in_quotes current state: True
DEBUG state @ line 1159:
prev preprocessor state: False
contd preprocessor: False
resetting preprocessor state to False
DEBUG # check @ line 1159:
in_comment: False
in_quotes: True
token[0] == '#': True
before quote check, in_quotes: True
after quote check, in_quotes: True
token: #, type: c
DEBUG loop end. in_quotes: True
DEBUG processing tkn 'endif', line 1159
in_quotes current state: True
before quote check, in_quotes: True # should NOT be "in_quotes"
after quote check, in_quotes: True
token: endif, type: c # should NOT be "C code"
DEBUG loop end. in_quotes: True
Summary
kmax
's built-in, SuperC-less tokenizer can incorrectly mark the end of strings, particularly in cases where strings have a significant number of special characters.get_tokens()
, is somewhat simplistic in how it interprets some inputs, and ultimately produces outputs that are too granular foranalyze_c_tokens
to correctly tag.kmax
's failure to account for preprocessor directives becomes a problem when the conditional at the top of the stack isn't popped, as the#endif
directive isn't properly accounted for.AssertionError
when attempting to find another conditional to match to.Steps to reproduce
Steps followed
To get a repaired configuration file for a commit range, I followed the steps here:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
and enter the directory.git diff {commit1}..{commit2} > patchset.diff
. I initially encountered this issue with the range52afb15e9d9a021ab6eec923a087ec9f518cb713
to0253d718a070ba109046299847fe8f3cf7568c3c
.make defconfig
.klocalizer
withklocalizer --repair .config -a x86_64 --include-mutex patchset.diff --verbose
.What I expected to happen
klocalizer
to repair the kernel configuration file.What actually happened
Additional information
While debugging the code, I made use of debug print statements in
analyze_c_tokens
and focused on whenin_quotes
would change. The code is seen below:For context, I had inserted other debug print statements elsewhere, especially in
get_conditional_blocks()
, as I had initially believed the preprocessor conditional wasn't being closed. However, this was not the case, so I investigated sister functions likeanalyze_c_tokens()
. The actual problem stems from a string filled with many special characters withindrivers/gpu/drm/omapdrm/omap_dmm_tiler.c
:The debug output is available below. In particular, note how the
*alphabet
strings are tokenized fully, while*special
is separated into different pieces, causing issues withinanalyze_c_tokens()
. Finally, note how theendif
preprocessor conditional is believed to bein_quotes
and C code.