oxc-project / backlog

backlog for collborators only
1 stars 0 forks source link

Lexer validate UTF-8 #93

Open overlookmotel opened 3 weeks ago

overlookmotel commented 3 weeks ago

Currently the parser takes source text as a &str.

This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use let source_text = std::fs::read_to_string(path);. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap. read_to_string uses std::str::from_utf8 internally, and it's not very efficient - it's not even SIMD-accelerated.

Thanks to @lucab's efforts in https://github.com/oxc-project/oxc/pull/4298 and https://github.com/oxc-project/oxc/pull/4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.

We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (https://github.com/oxc-project/oxc/issues/959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.

Parser would take an AsRef<[u8]> instead of a &str. If source text passes UTF-8 validation, ParserReturn could contain the source text cast to a &str.

We could have individual ByteHandlers for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the single UNI handler we have for all of them now.

NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid &str.

overlookmotel commented 2 weeks ago

See also: