This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use let source_text = std::fs::read_to_string(path);. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap. read_to_string uses std::str::from_utf8 internally, and it's not very efficient - it's not even SIMD-accelerated.
We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (https://github.com/oxc-project/oxc/issues/959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.
Parser would take an AsRef<[u8]> instead of a &str. If source text passes UTF-8 validation, ParserReturn could contain the source text cast to a &str.
We could have individual ByteHandlers for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the single UNI handler we have for all of them now.
NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid &str.
Currently the parser takes source text as a
&str
.This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use
let source_text = std::fs::read_to_string(path);
. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap.read_to_string
usesstd::str::from_utf8
internally, and it's not very efficient - it's not even SIMD-accelerated.Thanks to @lucab's efforts in https://github.com/oxc-project/oxc/pull/4298 and https://github.com/oxc-project/oxc/pull/4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.
We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (https://github.com/oxc-project/oxc/issues/959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.
Parser would take an
AsRef<[u8]>
instead of a&str
. If source text passes UTF-8 validation,ParserReturn
could contain the source text cast to a&str
.We could have individual
ByteHandler
s for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the singleUNI
handler we have for all of them now.NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid
&str
.