Fix fnematch buffer use-after-free

caffe3 commented 7 months ago

The new fnematch simplifies the code base by using i, j, k pointers to track the parsing. Unfortunately, when adjbuf is called, the pointers are can be invalidated because adjbuf uses realloc internally and can move the passed buf to a new memory address.

This commit switches i, j, k to be indices relative to buf that is updated on the fly.

Fixes the use-after-free on long input:

printf "%8192s\n" | tr " " "=" | ./a.out 'BEGIN{RS="th[^h]+"}{}'

caffe3 commented 7 months ago

Thanks for reviewing, @millert - let me know if any other changes should be made.

caffe3 commented 7 months ago

@mpinjr could you have a look too :+1:

caffe3 commented 7 months ago

I discovered another existing use-after-free in the same function with patbeg and added another commit to resolve that.

millert commented 7 months ago

Another option would be to add "int patoff = 0" and just set "patoff = i" instead of "pb = buf+ i". Either way you need to update patbeg at the end. I think using offsets consistently may be easier to follow.

caffe3 commented 7 months ago

@millert - I've updated as you suggested: and I agree, patoff is easier to read and follow.

millert commented 7 months ago

@caffe3 Looks good to me.

mpinjr commented 7 months ago

@caffe3

The new fnematch simplifies the code base by using i, j, k pointers to track the parsing. Unfortunately, when adjbuf is called, the pointers are can be invalidated because adjbuf uses realloc internally and can move the passed buf to a new memory address.

That was certainly a facepalming blunder. Thank you for catching it.

This commit switches i, j, k to be indices relative to buf that is updated on the fly.

It's not my call, but I prefer to keep using pointers instead of ints. Eventually, it would be nice to not be limited to INT_MAX records. And, either way, a few pointers will need updating after buffer relocation and/or resizing. Here's a diff that moves all of those updates so that they immediately follow the adjbuf that necessitates them. It compiled cleanly, passed the test suite, and your test case under valgrind.

https://github.com/mpinjr/awk/commit/a2453087b5ae7d0a213f9679e14bf634f79ae3b6

@mpinjr could you have a look too 👍

Sorry it took me a while. Very busy with other work at the moment. Thanks again for catching and reporting this. I appreciate it.

Unrelatedly, we should change MAX_UTF_BYTES to awk_mb_cur_max. For those libc's that can only ungetc one char, when operating under a single-byte character locale, we can avoid reading ahead and unnecessarily triggering multiple ungetcs. This will alleviate at least some of the ungetc issues. A complete fix is more complicated because it requires stream handling changes outside of fnematch (because the next record may not necessarily be read by fnematch if RS changes).

I hope everyone is doing well, Miguel

plan9 commented 7 months ago

thanks for the discussion. I assume the changes are settled, with enough many eyes involved. @mpinjr thanks for the suggestion vis a vis MAX_UTF_BYTES. will do. thanks again.

caffe3 commented 7 months ago

I'm happy with this PR and I don't have a strong opinion for whether pointers (in @mpinjr 's version) or my PR's indices are used.

plan9 commented 6 months ago

thank you all, @mpinjr fix has been included. MAX_UTF_BYTES is now awk_mb_cur_max.

onetrueawk / awk

Fix fnematch buffer use-after-free #223