starwing / luautf8

a utf-8 support module for Lua and LuaJIT.
MIT License
412 stars 68 forks source link

Add 'is_valid' and 'clean' functions #41

Closed alexdowad closed 1 year ago

alexdowad commented 1 year ago

As discussed in #40.

TODO: Fuzzing. TODO: Amend docs.

alexdowad commented 1 year ago

Added docs for new functions.

alexdowad commented 1 year ago

Just tried fuzzing is_valid with Clang+libFuzzer+ASAN+UBSAN for crashes. I only ran the fuzzer for less than a minute, but within that time it tried around 40,000,000 test strings without finding any crashes.

alexdowad commented 1 year ago

Tried fuzzing clean for about a minute; ~70,000,000 test inputs with no crashes found.

starwing commented 1 year ago

Thanks for contribution!

maybe we could think twice about the name "is_valid"? the Lua style do not use underscore, maybe "isvalid" better?

starwing commented 1 year ago

or, maybe a "utf8.invalidposition" is better? if the utf8 is valid, the function returns nil, otherwise it returns the position of the first ocurrs of the invalid bytes.

alexdowad commented 1 year ago

Thanks for contribution!

maybe we could think twice about the name "is_valid"? the Lua style do not use underscore, maybe "isvalid" better?

Thanks for the suggestion. I consulted this page when choosing the name: http://lua-users.org/wiki/LuaStyleGuide

Booleans - It can be helpful to prefix Boolean values or functions used as predicates with is, such as is_directory rather than directory (which might store a directory object itself).

Here is an example from the standard library, where is is used as a prefix, without an underscore: https://www.lua.org/manual/5.3/manual.html#pdf-coroutine.isyieldable

I agree with removing the underscore. Would that be satisfactory?

alexdowad commented 1 year ago

or, maybe a "utf8.invalidposition" is better? if the utf8 is valid, the function returns nil, otherwise it returns the position of the first ocurrs of the invalid bytes.

I think that is also a useful API function to provide, but I think it should be separate.

alexdowad commented 1 year ago

@starwing See the code I have just pushed. I've included docs and unit tests for invalidoffset, but haven't fuzzed it.

alexdowad commented 1 year ago

Fuzzing has just uncovered an interesting issue... the output of clean is not always valid UTF-8... if the replacement string itself is not valid UTF-8!

I don't know if that's something we should do something about... or if it is just up to the user to make sure they pass a valid replacement string.

alexdowad commented 1 year ago

I am thinking that clean should check if the replacement string is valid UTF-8 and raise an error if not.

The replacement string is likely to be very short, so checking it should have only a tiny performance cost.

starwing commented 1 year ago

I am thinking that clean should check if the replacement string is valid UTF-8 and raise an error if not.

The replacement string is likely to be very short, so checking it should have only a tiny performance cost.

I'm also thinking that check the replacement string is better.

alexdowad commented 1 year ago

OK, I've implemented that.

starwing commented 1 year ago

Merged, Thanks!

alexdowad commented 1 year ago

Thank you as well!

I added more assertions to my fuzzing code and ran the fuzzer a bit more, but didn't find any other problems.

Are you interested in committing the fuzzing code to the main repo, in case it is useful for others?

alexdowad commented 1 year ago

Oh, and the next question... when are you likely to cut a new release, so I can start using these new functions?

starwing commented 1 year ago

Are you interested in committing the fuzzing code to the main repo, in case it is useful for others?

That's good, you could open another PR for the fuzzing test. Thanks!

I'll make a new release asap.