taocpp / PEGTL

Parsing Expression Grammar Template Library
Boost Software License 1.0
1.93k stars 229 forks source link

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

Open catsidhe opened 1 month ago

catsidhe commented 1 month ago

Unless I'm mistaken, column currently counts bytes. I ended up re-calculating the column by walking back from data until I see a newline character or reach the beginning of the content, counting characters while ignoring non-first UTF-8 bytes. Would've been nicer if it was handled by PEGTL.

ColinH commented 1 month ago

Am I right to assume that what you are counting are codepoints? That might be possible to add for PEGTL 4.0, though we have been apprehensive to go down that road because the next step would be to take composed characters and grapheme clusters into consideration. That's a big can of worms to open.

catsidhe commented 1 month ago

Yes, just the code points, bytes outside 0x80..0xbf range. I figured it'd cover the majority of use cases, while being simple and cheap. But I understand your concern, that can of worms might be best left closed.

Another dimension to this is double-width and non-printable characters - those with wcwidth(c) values other than 1. Looks like most editors, gcc and clang, take wcwidth into account.

ColinH commented 1 month ago

Are you using lazy or eager position tracking?

catsidhe commented 1 month ago

Sorry, I don't quite know how to answer that. Maybe because I use it in a tree? Probably incorrectly, too!

auto from_utf8(std::string_view) -> std::u32string;
inline auto line_width(std::u32string_view sv) -> int {
  return std::accumulate(sv.begin(), sv.end(), 0, [](int w, char32_t uc) {
    return w + std::max(width(uc), 0);
  });
}
class node {
  // ...
  template<typename, typename Input>
  void start(Input const &in) noexcept {
    auto i = in.iterator();
    line = i.line;
    // Can I do this? Maybe not for all inputs.
    column = line_width(from_utf8({i.data - (i.column - 1), i.data})) + 1;
  }
};

Actually, I'm no longer sure how to best calculate columns. Consider this snippet:

int /*あいうえお*/ 123;

According to VSCode, the digit 1 is on column 15. It treats each character as occupying exactly one column.

Clang thinks it's on column 25 (counting bytes, like PEGTL):

test.cpp:1:25: error: expected unqualified-id
    1 | int /*あいうえお*/ 123;
      |                    ^

GCC says 20 (presumably using wcwidth):

test.cpp:1:20: error: expected unqualified-id before numeric constant
   1 | int /*あいうえお*/ 123;
     |                    ^~~

(Arrows point to 1 in both examples above in the terminal.)

Nano: 20 Emacs: 19 (it counts from 0) Vim shows two numbers: 25-20

I'm leaning towards using wcwidth. But I don't think PEGTL should (since it's not in the standard library).

For best flexibility, perhaps a user-supplied character traits class is needed, that gets to decide how many columns each code point takes.