Open lwz23 opened 5 days ago
The use of std::str::from_utf8_unchecked
is limited to topiary-tree-sitter-facade
. This is vendored code which is needed for the WASM-based playground. There are plans to move away from the WASM-based playground, so correcting these occurrences is deprioritised.
The problem does not affect the Topiary CLI:
$ echo -ne '\xff\xff' > invalid-utf8.txt
$ topiary format --language json < invalid-utf8.txt
[2024-11-25T09:57:58Z ERROR topiary] Failed to read input contents
[2024-11-25T09:57:58Z ERROR topiary] Cause: stream did not contain valid UTF-8
$ topiary format --language json --query invalid-utf8.txt <<< "{}"
[2024-11-25T09:58:43Z ERROR topiary] Could not read or write to file
[2024-11-25T09:58:43Z ERROR topiary] Cause: stream did not contain valid UTF-8
Describe the bug
The method 'parse',
child_by_field_name
and 'field_id_for_name'usesstd::str::from_utf8_unchecked
to convert a byte slice to a string without validating whether the input is valid UTF-8. This can lead to undefined behavior (UB) if the input byte slice contains invalid UTF-8 sequences. SinceAsRef<[u8]>
does not enforce any validation on the byte slice, the caller can supply invalid UTF-8 data, violating the assumptions offrom_utf8_unchecked
.To Reproduce
Steps to reproduce the behavior:
child_by_field_name
.child_by_field_name
method with an invalid UTF-8 byte slice as input.Expected behavior The method should validate the input byte slice for UTF-8 compliance before using std::str::from_utf8_unchecked. Invalid UTF-8 inputs should result in an error instead of causing undefined behavior.
The panic output when running the provided example:
same for 'field_id_for_name' and 'parse'