Closed DavidKorczynski closed 4 years ago
Hi @DavidKorczynski and thanks for the thorough explanation.
Just to make sure I understand the report correctly: You are passing an incomplete regex-pattern right?
"\\\x01[^\\\xff][\\\x00"
looks like it's missing a closing bracket at the end ]
?
Yes, that's correct! These are an invalid regex patterns - I was only reporting this because of them being memory corruption bugs. If you define your library such that it expects valid regex patterns and otherwise have undefined behavior then the bugs are not in scope as such.
Thanks for the update.
Please don't mistake my response for disregard. I really appreciate that you've taken the time to document this defect so thoroughly.
This code does expect valid regex patterns, but still it is not very nice to segfault or corrupt memory. I have always loathed the missing validation but I left it out because I thought implementing it would require too much energy - and code! this project aims for small footprint.
However I don't think it would require much code nor energy to fix this problem for character-classes, so I'll give it some thought and see if I can think of a good fix.
If you have any good ideas for a fix, don't hesitate to share your thought or make a PR.
Sounds good @kokke - your design choices make sense to me and thanks for the kind and quick replies.
I filed an issue of the bugs because I am making some educational videos on how to use program analysis tools (symbolic execution) in order to analyse the properties of a code base, and your project is a nice project for demonstration purposes because, as you write, it has a small footprint. If you are interested I could share a link to the videos (they are freely available on Youtube).
Thanks for the kind words @DavidKorczynski :)
Most (all?) of my public projects are designed to be runnable on embedded platforms, and usually that means I make some harsh compromises.
What a coincidence, I love static analysis! I would love to see whatever material you have available :)
I dabble with static analysis myself, mostly data-flow analysis and abstract interpretation, but I am not picky :D
I am loving the videos of Christopher Reichenbach, e.g. this one https://www.youtube.com/watch?v=NVBQSR_HdL0
... and if I may ask, how do you find projects well suited for static analysis yourself? I have stumbled into this problem just recently.
I've made a tool that clones a bunch of repos with C-code and runs my analysis-tool on them. Currently, I am just finding repos manually through Github's webpage and copy/pasting the repo-links, which gets tedious...
Awesome :)! I will add a link here when I upload the video - which should happen in one of the coming days.
That's great with data-flow analysis and abstract interpretation. I have used both extensively, and my PhD thesis involved a lot of (dynamic) taint analysis to analyse malware (you can find it on the Oxford University Research Archive here: https://ora.ox.ac.uk/objects/uuid:ef3faf5e-7733-4460-bba4-9e18893dec1e). Program analysis and software verification is my kind of jam too!
I haven't seen those videos before, but by skimming through the one you link to it seems to cover the topics nicely!
About target selection it really varies depending on the properties of the tool I am testing and the specific aspects of the tool that I am testing (edge-case/large-scale/educational demonstration/user interface/ etc.). In general I think your approach is largely the same approach as I do and also how other researchers who test tools. It can be quite a tedious process for sure, and doing a full empirical-analysis of your tool can take months of effort too. In general it's probably good to have a collection of repositories in your sleeve that you can pull out in different contexts - that will carry you a long way I think and often give you quick insights about the properties of your tool(s). Is your tool available somewhere :)?
Awesome :)! I will add a link here when I upload the video - which should happen in one of the coming days.
Nice! I'm looking forward to seeing the video. There are too few videos on static analysis in my opinion. Also thanks for taking the time to answer my questions.
That's great with data-flow analysis and abstract interpretation. I have used both extensively, and my PhD thesis involved a lot of (dynamic) taint analysis to analyse malware (you can find it on the Oxford University Research Archive here: https://ora.ox.ac.uk/objects/uuid:ef3faf5e-7733-4460-bba4-9e18893dec1e). Program analysis and software verification is my kind of jam too!
I'll take a look at your thesis and see how much I understand :) I majored in EE/CS and only did introductory coursework on formal methods and static analysis, so there is a lot of notation/syntax that I don't understand and a lot of branches of math that I am not familiar with. Often times I struggle to read current published research, and need to look much up on wikipedia alongside. Some of the older stuff is more forgiving :D
About target selection it really varies depending on the properties of the tool I am testing and the specific aspects of the tool that I am testing (edge-case/large-scale/educational demonstration/user interface/ etc.). In general I think your approach is largely the same approach as I do and also how other researchers who test tools. It can be quite a tedious process for sure, and doing a full empirical-analysis of your tool can take months of effort too. In general it's probably good to have a collection of repositories in your sleeve that you can pull out in different contexts - that will carry you a long way I think and often give you quick insights about the properties of your tool(s).
Okay - I was hoping that the Github API could be of some help in automating the process, but haven't RTFM yet. I have often used the Toyota ITC benchmark suite as a first target to get an idea of coverage of very basic errors. I know the suite has flaws, but it also does test a pretty wide range of defects so I find it useful.
Is your tool available somewhere :)?
No. I don't have a well-enough working tool at the moment. I only have toy prototypes and a tool that I have donated to my workplace and worked on there (so not allowed to share the code). It is basically just a (much) more developed version of tiny-lint working with pattern-matching at the syntax level, i.e. not very technologically advanced. That level of sophistication can go a long way though! cppcheck works great on embedded code with static allocation. It also scales well, which none of my prototype tools do :(
Currently I mostly make toy scripts to test various implementation strategies for the abstract domain. I am trying to understand Anders Møller's course notes but I am still struggling a bit with both theory and practice :D Otherwise I write scripts for pycparser to manually roll-out loops to get branchless code, insert assertions, adjust error-output by abusing/spamming the #line directive and feeding it to FOSS tools :D it's a mess, but it has worked fantastically for me.
I have somewhat emulated the approach taken by CBMC. I want to get closer, which is why I have begun studying abstract interpretation and have begun trying to emit constraints to an SMT solver. In my testing the standard integer interval domain is a bit too coarse, and finite sets scale horribly.. I am struggling with encoding the maths of more advanced theories in the SMT solver..
Hi @DavidKorczynski - I think I have a solution.
It's not the best solution, but it's the easiest for now.
https://github.com/kokke/tiny-regex-c/commit/1a279e04014b70b0695fba559a7c05d55e6ee90b
The better solution would be to pass a length of the pattern as well, and replace all null-checks with checks against the length of the string. That would enable handling of regex patterns over binary data.
I've added two test-cases for the patterns you've reported, see test_compile.c.
If you can come up with more examples, please let me know.
Hi @kokke I came across dbde6da49707a6008d33fb4ccc7d84cac79ae1f5 in my recent notifications and there are few code style related questions that I'm too tempted to ask:
pattern1
/pattern2
? My personal rule of thumb is that everything should be const
except those things that can't be constants due to some obvious reasons. Do you share this idea?re_compile
's argument type is const char *
, so it's possible to write just re_compile("expression")
or const char *pattern1 = "..."; void *p1 = re_compile(pattern1)
if it provides any additional readability.0
instead of NULL
when you're comparing pointers?
There's a similar question about '\0'
vs 0
in e.g. 1a279e04014b70b0695fba559a7c05d55e6ee90b (they say, explicit is better than implicit, heh).It doesn't change anything in tests but similar constructions may affect something in other parts of code and I'm thinking about code style consistency across all files in the project.
Probably I'm not the first one who asks these questions, I hope it doesn't annoy you.
Hi @Nable80 - you're always welcome to ask questions :)
1) There is no need for a mutable copy at all. A const char*
would do just fine. I just wanted tests for the failure-cases, to verify whether or not my proposed fix would work. I like the idea about const-strictness, but I am not too good myself at using const
I must admit. In C const has less
2) There is no need for variables either :) I made a few changes that takes some of your points into consideration: https://github.com/kokke/tiny-regex-c/commit/71eb437a5005038182283036463106e0be25671f -- any feedback is appreciated.
3) To me there is not much difference between \0
, NULL
, 0
, 0x00
etc. so it's just laziness on my part...
Consistency is a virtue and I will make efforts to get rc.c
cleaned up with a more consistent coding style :+1:
Your questions are valid and I appreciate all the feedback I get, because it often improves the code or the documentation :)
Thank you very much for your kind replies!
In C const has less
Are some words missing after this part in point 1? I think I got your idea anyway, const
is quite limited in C due to backward compatibility reasons and it's really sad. I often use C for firmware and sometimes I have to use static const
workaround, sometimes I just have to accept that compiler stores some constant array/structure as-is instead of tearing it into pieces and using this information to optimize-out all unused branches. It's still better nothing.
I made a few changes that takes some of your points into consideration: 71eb437
IMHO loop isn't better here (unless you add more code to print additional details). E.g. previous version could fail with something like "assertion failure at line 16" or "at line 20" (at least some information about failed test) but now it's always "line 25", no matter which expression fails. IMHO it's better to just shorten the previous version and add a one-line comment to each expression, something like this:
int main()
{
/* Test 1: inverted set without a closing ']' */
assert(re_compile("\\\x01[^\\\xff][^") == NULL);
/* Test 2: set with an incomplete escape sequence and without a closing ']' */
assert(re_compile("\\\x01[^\\\xff][\\") == NULL);
return 0;
}
It looks OK to me when there aren't much tests. When (if) their number grows to some degree, it would be time to start using some auto-test library with all its macro expressions, rules, fancy output, etc.
Btw, if you're interested in static analysis, here are the gcc/clang flags that I consider mandatory in my C projects (note: I don't use C++, that's why several well-known flags are absent here): -Wall -Wextra -Wnarrowing -Wwrite-strings -Wcast-qual -Werror
, they work best when combined with -O2 -fomit-frame-pointer -flto
because some analysis checks are parts of optimization passes, so they're not called if you don't enable optimizations, sometimes bugs can only be found with intermodular analysis, thus the -flto
flag.
It may also be interesting to try PVS-Studio. Warning: it's not free software at all (thus I can't recommend using it) but they offer licenses for free to all enthusiasts, AFAIR you don't even have to write them nowadays, one should just download it to an isolated VM and follow the instructions on their site. They often test their tool on different open-source projects and publish reports and explanations, you may use these reports at least as a source of interesting projects to test with your tools and compare results.
Are some words missing after this part in point 1?
Haha yes indeed - how did I miss that.. I was starting a paragraph about how const is difficult to enforce etc.- you got the gist of it.
Valid critique of the loop-construct - I will refactor back to your suggestion, which is clearer and yields better error messages (good point RE: "assertion failed on line 25").
Interesting, I did not know about -Wnarrowing and -Wqual-cast. I will check if our gcc 4.x toolchain at work supports them.
I mostly use the following set of warnings, along with -O2 or -Os, depending on need-for-speed or compactness:
-Wall -Wextra -Wundef -Wstrict-prototypes
I find -Wundef to be indispensable on a lot of C code bases: "Warn if an undefined identifier is evaluated in an #if directive. Such identifiers are replaced with zero."
I haven't tried PVS-Studio because of the non-free licensing. I think I've tried most other FOSS tools for static analysis on C code :D I will try and see if I can get my hand on a personal/research license of PVS-Studio. Good point checking which projects PVS-Studio are scanning :+1:
I will check if our gcc 4.x toolchain at work supports them.
Oh, it sounds like a somehow painful enterprise experience.
along with -O2 or -Os, depending on need-for-speed or compactness
Yes, I usually build with -O2 (desktop/server builds and sometimes I try to use it for firmware, just to compare size and check whether static analysis could find anything in addition to -Os level) or -Os (regular firmware builds) too.
Thank you for -Wundef and -Wstrict-prototypes, I thought they're part of -Wall or -Wextra but they aren't. I've added them to my Makefile and fixed several function definitions. While checking -Wundef documentation in man gcc
I found -Wunused-macros and -Wbad-function-cast, they're not enabled by default but they look promising too. There's also -Wconversion flag (and several related flags) but it may require a lot of changes to properly fix all warnings that it enables. Yet it's still worth trying.
I also found these lines today in my old notes: -pedantic -Wall -Wextra -Wunused-parameter -Wstrict-overflow=5 -Wundef -Wshadow -Wcast-qual -Wcast-align -Wconversion -Wsign-conversion -Wmissing-declarations -Wredundant-decls -Wdisabled-optimization -Woverloaded-virtual -Wctor-dtor-privacy -Wold-style-cast
-Werror -Wall -Wextra -Wpedantic -Wcast-align -Wcast-qual -Wconversion -Wctor-dtor-privacy -Wenum-compare -Wfloat-equal -Wnon-virtual-dtor -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wsign-conversion -Wsign-promo
-ffast-math -ffp-contract=fast -fsingle-precision-constant -Wdouble-promotion
Most of these flags were already mentioned above or can be used with C++ only but it's still possible that you may find something interesting in this dump.
Just a quick update for now @kokke : The following link is the instructional video on how to use KLEE for analysis of open source projects, which is how I found the bugs reported in this issue : https://www.youtube.com/watch?v=XaYEmwVMRt4
Hi @DavidKorczynski and thanks for the update. I never got back to you, because I had a feeling the second null-check could be avoided, if using the last fix (the one only present in my comments). I wanted to test that out, so I tried getting KLEE to compile and in my quest for that, I ended up doing a full system upgrade.
It went mostly like this:
... but with a more happy ending.
I now have a current version of Debian, but I still can't compile KLEE :laughing:
Oh well, Docker is also nice :rofl: and I have a video-tutorial and everything :+1:
One small critique of the videos: The terminal/console font is almost too small for me. Even in full hd resolution, I have a hard time reading the text on my 14" laptop.
For now I have implemented all the null-checks discussed in https://github.com/kokke/tiny-regex-c/commit/1a279e04014b70b0695fba559a7c05d55e6ee90b -- see commit https://github.com/kokke/tiny-regex-c/commit/349ef187d00444349a38cfc409b9f023a3c31e5f
Agrhh @kokke - I feel bad for starting this, which ended up having you do a system upgrade :S
It can be a bit of a pain to compile KLEE from source - there are quite a few components in place and it's not obvious at first.
Thanks a lot for the critique of the videos and I totally agree and good work on fixing the bugs, thanks for being so responsive.
@DavidKorczynski !! I never gave you a proper reply... Thank you for taking the time to work with me and the project.
I've made the page below where I've begun some KLEE-checking of the code, but I've only done re_compile() for now. Do you have any test-drivers lying around with more coverage that I could use, now that I'm namedropping you anyways? ;)
https://github.com/kokke/tiny-regex-c/blob/master/formal_verification.md
I found two minor overflows in
re_compile
:First overflow
The first one is on line 121:
https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L121
The bug is triggered with the following buffer input to the function:
"\\\x01[^\\\xff][^\x00"
The overflow happens because of the
[^
characters in the buffer, in particular if execution enters the condition: https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L173Then the following code will execute: https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L179-L190
where
i
is increased on line 181 and 190, and finally before the loop continuesi
will be increased again:https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L226-L229
Which causes the overflow to happen on https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L121 since
i
will now be 1-byte off in the buffer.Second overflow
The second overflow happens on line 190: https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L190
This bug happens in case the following buffer is given to the function:
"\\\x01[^\\\xff][\\\x00"
The bug happens in the code:
https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L173-L192
This happens because of the code:
https://github.com/kokke/tiny-regex-c/blob/9d46276c48091a22ac4fa6472603fe4b17d483ce/re.c#L190-L201
Where
i
is increased in size on lines: 190 and also line 200. Becausei
is increased on line 200 in case of a\
character,i
will be one out of bounds on line 190 when the while loop enters the second iteration.