starwing / luautf8

a utf-8 support module for Lua and LuaJIT.
MIT License
412 stars 68 forks source link

Fix bugs in NFC normalization code #51

Closed alexdowad closed 3 months ago

alexdowad commented 3 months ago

Rather than writing out the whole story of how I discovered these bugs, let me point you to the blog post which I wrote about it: https://alexdowad.github.io/dont-fuzz-uninstrumented-binary/

I would love to contribute some of these new test cases to the official test suite from the Unicode Consortium; we'll see if they welcome submissions.

FYA @starwing

starwing commented 3 months ago

Maybe you could think of another of my C module lpath:

https://github.com/starwing/lpath/blob/cdc573e8bf3fabe67e67a41651de1b2da93f306c/lpath.c#L71

for the vector routines instead of raw malloc/free/realloc?

alexdowad commented 3 months ago

@starwing, I'm open to using those vector routines, but just one thing...

The code I am currently using in lua-utf8 normally uses a small, on-stack buffer for combining mark codepoints, and only switches to a dynamically allocated buffer if the on-stack one is too small. (The on-stack buffer can hold 8 codepoints, which is more than will realistically be needed for any natural language text. So basically 100% of the time, when lua-utf8 is processing natural language strings, it will never need to switch to a dynamically allocated buffer.)

Can that be done with the lpath routines?

starwing commented 3 months ago

@alexdowad you mean the buffer in lua-protobuf? https://github.com/starwing/lua-protobuf/blob/3f9ffd485262edbcde618296c07b862ba5a74b38/pb.h#L170

I just think you could make a clear interface for the buffer you use, makes it separate from the normalize algorithm, just like the one in lpath/lua-protobuf.

starwing commented 3 months ago

The changes looks good for me. Maybe I will break the complicate algorithm into same functions to make it more clear when I have time, but It indeed looks good for now.

alexdowad commented 3 months ago

@alexdowad you mean the buffer in lua-protobuf? https://github.com/starwing/lua-protobuf/blob/3f9ffd485262edbcde618296c07b862ba5a74b38/pb.h#L170

I just think you could make a clear interface for the buffer you use, makes it separate from the normalize algorithm, just like the one in lpath/lua-protobuf.

@starwing, the buffer implementation in lua-protobuf would indeed work for this use case.