henriklovhaug / md-tui

Markdown renderer in the terminal
GNU Affero General Public License v3.0
205 stars 13 forks source link

Unicode handling issues #122

Closed cyqsimon closed 5 months ago

cyqsimon commented 5 months ago

Multi-char unicode characters inside code blocks seem to be improperly handled.

MRE

```sh
# πŸ™‚

Running `mdt file.md` results in a panic:

The application panicked (crashed). byte index 4 is not a char boundary; it is inside 'πŸ™‚' (bytes 3..7) of `

πŸ™‚

` in src/nodes/textcomponent.rs, line 379 thread: main



## The problem

https://github.com/henriklovhaug/md-tui/blob/046d3be0dc83663329458bac04f30006c6b526fd/src/nodes/textcomponent.rs#L375-L389

Here, `str::chars` iterates by unicode characters, which is the kind of index your `start` and `end` refer to. However, the string slice syntax is not UTF8-aware and instead indexes by bytes. A UTF8-encoded unicode character is very often not 1 byte, so `word.content()[start..end]` and `word.content()[start..]` are semantically incorrect.

I think the simplest way to fix this is to accumulate the number of bytes using [`char::len_utf8`](https://doc.rust-lang.org/std/primitive.char.html#method.len_utf8) and use that as `start` and `end`. Although there's probably a cleaner way to write the whole blob.

Also I'm not sure if there are other instances of similar mistakes within the codebase. Maybe it's worth double checking.

## Versions

0.7.3 (from ArchLinux repository) and 0.7.4 (from crates.io)
henriklovhaug commented 5 months ago

Hi. Thanks for the issue. Came across this yesterday myself. I naively thought tree-sitter would give back indexes on the char boundaries, but as you found out as well. It doesn't. Should not be an issue elsewhere, as I don't do many (any?) operations on single chars.

henriklovhaug commented 5 months ago

It's fixed. I want to fix the list alignment issue before I push out a new version