sourcegraph / scip-clang

Apache License 2.0
56 stars 7 forks source link

fix: Replace invalid UTF-8 characters in doc comments #453

Closed varungandhi-src closed 10 months ago

varungandhi-src commented 10 months ago

The types for doc comments in our Protobuf code are strings, which means the the contents must be valid UTF-8. However, we were not doing validation before storing the contents.

This PR adds a validation step, and if the validation fails, then we substitute invalid characters with the standard unicode replacement character.

The utfcpp library was chosen as it is pretty decent at benchmarks and has a very easy to use API for our purposes.