tweag / topiary

https://topiary.tweag.io/
MIT License
582 stars 29 forks source link

Unsoundness in 'parse','child_by_field_name' and 'field_id_for_name' #792

Open lwz23 opened 5 days ago

lwz23 commented 5 days ago

Describe the bug
The method 'parse', child_by_field_name and 'field_id_for_name'uses std::str::from_utf8_unchecked to convert a byte slice to a string without validating whether the input is valid UTF-8. This can lead to undefined behavior (UB) if the input byte slice contains invalid UTF-8 sequences. Since AsRef<[u8]> does not enforce any validation on the byte slice, the caller can supply invalid UTF-8 data, violating the assumptions of from_utf8_unchecked.

pub fn child_by_field_name(&self, field_name: impl AsRef<[u8]>) -> Option<Self> {
            let field_name = field_name.as_ref();
            let field_name = unsafe { std::str::from_utf8_unchecked(field_name) };
            self.inner.child_for_field_name(field_name).map(Into::into)
        }

pub fn field_id_for_name(&self, field_name: impl AsRef<[u8]>) -> Option<u16> {
            let field_name = field_name.as_ref();
            let field_name = unsafe { std::str::from_utf8_unchecked(field_name) };
            self.inner.field_id_for_name(field_name)
        }

pub fn parse(
            &mut self,
            text: impl AsRef<[u8]>,
            old_tree: Option<&Tree>,
        ) -> Result<Option<Tree>, ParserError> {
            let text = text.as_ref();
            let text = unsafe { std::str::from_utf8_unchecked(text) };
            let text = &text.into();
            let old_tree = old_tree.map(|tree| &tree.inner);
            let options = Some(&self.options);
            self.inner
                .parse_with_string(text, old_tree, options)
                .map(|ok| ok.map(Into::into))
                .map_err(Into::into)
        }

To Reproduce
Steps to reproduce the behavior:

  1. Create an instance of the structure that implements child_by_field_name.
  2. Call the child_by_field_name method with an invalid UTF-8 byte slice as input.
fn main() {
    // Invalid UTF-8 byte sequence
    let invalid_utf8: &[u8] = &[0xFF, 0xFF]; // Invalid UTF-8 bytes
    let obj = MyStruct { inner: InnerStruct };

    obj.child_by_field_name(invalid_utf8); // Call the unsafe method
}

Expected behavior The method should validate the input byte slice for UTF-8 compliance before using std::str::from_utf8_unchecked. Invalid UTF-8 inputs should result in an error instead of causing undefined behavior.

The panic output when running the provided example:

thread 'main' panicked at core\src\panicking.rs:221:5:
unsafe precondition(s) violated: hint::unreachable_unchecked must never be reached
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread caused non-unwinding panic. aborting.

same for 'field_id_for_name' and 'parse'

Xophmeister commented 5 days ago

The use of std::str::from_utf8_unchecked is limited to topiary-tree-sitter-facade. This is vendored code which is needed for the WASM-based playground. There are plans to move away from the WASM-based playground, so correcting these occurrences is deprioritised.

The problem does not affect the Topiary CLI:

$ echo -ne '\xff\xff' > invalid-utf8.txt
$ topiary format --language json < invalid-utf8.txt
[2024-11-25T09:57:58Z ERROR topiary] Failed to read input contents
[2024-11-25T09:57:58Z ERROR topiary] Cause: stream did not contain valid UTF-8
$ topiary format --language json --query invalid-utf8.txt <<< "{}"
[2024-11-25T09:58:43Z ERROR topiary] Could not read or write to file
[2024-11-25T09:58:43Z ERROR topiary] Cause: stream did not contain valid UTF-8