Function to "clean" input string from any invalid UTF-8 sequences?

starwing / luautf8

a utf-8 support module for Lua and LuaJIT.

MIT License

412 stars 68 forks source link

Function to "clean" input string from any invalid UTF-8 sequences? #40

Closed alexdowad closed 1 year ago

alexdowad commented 1 year ago

There are cases where an application receives some arbitrary input from the 'outside world' (maybe from a network or read in from a file) and needs to pass it on to an upstream system which only accepts valid UTF-8. In such cases, it is useful to have a function which goes through the input and "cleans up" any invalid UTF-8 sequences, perhaps replacing them with an error marker like U+FFFD (REPLACEMENT CHARACTER).

Would you like to include something like that in this library? If so, I can send you a PR.

starwing commented 1 year ago

Thanks for the suggest! PR are always welcome :-)

alerque commented 1 year ago

This would be useful to me too. Maybe you can consider making the function return not just the cleaned string but either a boolean flag or perhaps a counter indicating that changes were made or how many. That way people wouldn't have to compare the original and the new using possibly expensive and even unsafe string comparisons to figure out if the input was munged or not.

alexdowad commented 1 year ago

@alerque Good idea.

I drafted the code this morning, just testing it now.

alexdowad commented 1 year ago

@alerque Do you think a boolean flag or a count of replacements is a better API?

alerque commented 1 year ago

I'm actually not sure. For the use case I had in mind I'd actually be more interested in a table with start/end offsets for everything cleaned from the original string, but I am not suggesting that would be the best return value. Probably just a second return value with a boolean true/false would be the best start. That would also leave the door open to optional third/forth return values to be added corresponding to function flags that asked for more detailed information.

alexdowad commented 1 year ago

If you look at the code in the PR I've just opened, there is a helper function called utf8_invalid_offset. Another option would be to expose that more directly; allow the user to pass a string (and optional start offset) and give them the offset of the first invalid sequence in that string.

alerque commented 1 year ago

That looks useful. Based on that my thought is that just a boolean return value for the clean function would be most useful, then for the cases (as in my use case) where I want to return contextual error information in addition to working with the cleaned string I could check the flag, then if appropriate iterate over invalid cases.

alexdowad commented 1 year ago

Code was merged.