tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
151 stars 36 forks source link

Combining characters in identifiers are not parsed correctly #101

Open expipiplus1 opened 1 year ago

expipiplus1 commented 1 year ago

An objectionable file and the treesitter tree:

a = () -- single 'a'
â = () -- single 'a with circumflex' character
â = () -- single 'a' with combining circumflex u770
function [0, 0] - [0, 6]
  name: variable [0, 0] - [0, 1]
  rhs: exp_literal [0, 4] - [0, 6]
    con_unit [0, 4] - [0, 6]
comment [0, 7] - [0, 20]
function [1, 0] - [1, 7]
  name: variable [1, 0] - [1, 2]
  rhs: exp_literal [1, 5] - [1, 7]
    con_unit [1, 5] - [1, 7]
comment [1, 8] - [1, 47]
function [2, 0] - [2, 8]
  name: variable [2, 0] - [2, 1]
  ERROR [2, 1] - [2, 3]
    ERROR [2, 1] - [2, 3]
  rhs: exp_literal [2, 6] - [2, 8]
    con_unit [2, 6] - [2, 8]
comment [2, 9] - [2, 53]

Thank you for all the hard work maintaining this library btw!

tek commented 1 year ago

For varids, we use this regex:

varid_pattern = /[_\p{Ll}](\w|')*#?/u

The first character is in the Ll class of lowercase letters, and it's unclear to me whether that would match the combined codepoints or just the a without diacritic…but since it also fails when the combined character is at a later position, I would assume that the \w class is insufficient.

Gonna investigate later, but if you have more useful insights, please let me know!