Closed alexdowad closed 1 year ago
Thanks for the suggest! PR are always welcome :-)
This would be useful to me too. Maybe you can consider making the function return not just the cleaned string but either a boolean flag or perhaps a counter indicating that changes were made or how many. That way people wouldn't have to compare the original and the new using possibly expensive and even unsafe string comparisons to figure out if the input was munged or not.
@alerque Good idea.
I drafted the code this morning, just testing it now.
@alerque Do you think a boolean flag or a count of replacements is a better API?
I'm actually not sure. For the use case I had in mind I'd actually be more interested in a table with start/end offsets for everything cleaned from the original string, but I am not suggesting that would be the best return value. Probably just a second return value with a boolean true/false would be the best start. That would also leave the door open to optional third/forth return values to be added corresponding to function flags that asked for more detailed information.
If you look at the code in the PR I've just opened, there is a helper function called utf8_invalid_offset
. Another option would be to expose that more directly; allow the user to pass a string (and optional start
offset) and give them the offset of the first invalid sequence in that string.
That looks useful. Based on that my thought is that just a boolean return value for the clean function would be most useful, then for the cases (as in my use case) where I want to return contextual error information in addition to working with the cleaned string I could check the flag, then if appropriate iterate over invalid cases.
Code was merged.
There are cases where an application receives some arbitrary input from the 'outside world' (maybe from a network or read in from a file) and needs to pass it on to an upstream system which only accepts valid UTF-8. In such cases, it is useful to have a function which goes through the input and "cleans up" any invalid UTF-8 sequences, perhaps replacing them with an error marker like U+FFFD (REPLACEMENT CHARACTER).
Would you like to include something like that in this library? If so, I can send you a PR.