Open j39m opened 4 years ago
I feel that the solution here should be to read max_chars
utf-8 characters instead of max_chars
bytes.
Agreed. I'm thinking
read()
in increments of max_chars
(assuming that pure ASCII is the most common case, which amounts to a single read()
),read()
and do more if necessary and possible,max_chars
UTF-8 characters but never read more than 4095 bytes.Does this seem like a reasonable approach?
However, all of this sounds too much to implement in i3status. If you want to see C code for this there is glib: https://gitlab.gnome.org/GNOME/glib/-/blob/master/glib/gutf8.c. Specifically we'll need g_utf8_validate
, g_utf8_strlen
and dependencies.
AFAIK, glib is not an i3status dependency like in i3.
When I originally envisioned this, I was only thinking of peeking at the leading 4-5 bits of any byte (marginally improving on the status quo but not implementing proper UTF-8 support). Would you prefer a PR that creates a glib dependency or are you saying this is infeasible for now?
Now that I think of it the following is not that hard:
Notes:
Looks good to me, thank you!
Would you be willing to add this test case to your CL?
Well, our implementation looks very similar, didn't know you had already started writing this. Anyway, let's see what the other members say about this. I'll add you as a co-author in the final commit if approved.
read_file
block to print a file whose contents may contain non-ASCII characters encoded in UTF-8.Max_characters
to 120.print_file_contents()
doesn't attempt to observe logical boundaries of UTF-8 octets, so there are cases where the printed line ends in gibberish. (Side effect - this messes with my Pango markup.) For example, this sequence of three bytes followed by the four-byte symbol "🈚":would reproduce the issue for me.
My naive reading of RFC 3629 suggests that it's easy to detect UTF-8 octet sequences. May I submit a pull request to further truncate file contents upon encountering clipped UTF-8?