Feature request: handle clipped UTF-8 in print_file_contents()

i3 / i3status

Generates status bar to use with i3bar, dzen2 or xmobar

BSD 3-Clause "New" or "Revised" License

601 stars 253 forks source link

Feature request: handle clipped UTF-8 in print_file_contents() #410

Open j39m opened 4 years ago

j39m commented 4 years ago

I use a read_file block to print a file whose contents may contain non-ASCII characters encoded in UTF-8.
I set Max_characters to 120.
It looks like print_file_contents() doesn't attempt to observe logical boundaries of UTF-8 octets, so there are cases where the printed line ends in gibberish. (Side effect - this messes with my Pango markup.) For example, this sequence of three bytes followed by the four-byte symbol "🈚":
```
foo🈚🈚🈚🈚🈚...[snip]
```
would reproduce the issue for me.

My naive reading of RFC 3629 suggests that it's easy to detect UTF-8 octet sequences. May I submit a pull request to further truncate file contents upon encountering clipped UTF-8?

orestisfl commented 4 years ago

I feel that the solution here should be to read max_chars utf-8 characters instead of max_chars bytes.

j39m commented 4 years ago

Agreed. I'm thinking

optimistically read() in increments of max_chars (assuming that pure ASCII is the most common case, which amounts to a single read()),
count leftover UTF-8 characters on each read() and do more if necessary and possible,
lay down NUL terminator immediately if we see bad UTF-8, and
update manpage to indicate that we try to display up to max_chars UTF-8 characters but never read more than 4095 bytes.

Does this seem like a reasonable approach?

orestisfl commented 4 years ago

However, all of this sounds too much to implement in i3status. If you want to see C code for this there is glib: https://gitlab.gnome.org/GNOME/glib/-/blob/master/glib/gutf8.c. Specifically we'll need g_utf8_validate, g_utf8_strlen and dependencies.

AFAIK, glib is not an i3status dependency like in i3.

j39m commented 4 years ago

When I originally envisioned this, I was only thinking of peeking at the leading 4-5 bits of any byte (marginally improving on the status quo but not implementing proper UTF-8 support). Would you prefer a PR that creates a glib dependency or are you saying this is infeasible for now?

orestisfl commented 4 years ago

Now that I think of it the following is not that hard:

Allocate 4 * max_chars memory
Read 4 * max_chars memory
while i < 4 * max_chars
1. cnt++
2. if buf[i] == 0: break
3. else if buf[i] starts with 0: i++
4. else if buf[i] starts with 110: i+=2
5. else if buf[i] starts with 1110: i+=3
6. else if buf[i] starts with 11110: i+=4
7. if cnt == max_chars: break
buf[i] = '\0';

Notes:

We don't fully check for bad characters, let pango handle them
Each utf8 char is a maximum of 4 bytes so we can allocate all the memory we need from the start and read the max possible number of bytes

j39m commented 4 years ago

Looks good to me, thank you!

Would you be willing to add this test case to your CL?

orestisfl commented 4 years ago

Well, our implementation looks very similar, didn't know you had already started writing this. Anyway, let's see what the other members say about this. I'll add you as a co-author in the final commit if approved.