tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
151 stars 36 forks source link

UnicodeSyntax support #93

Open maralorn opened 1 year ago

maralorn commented 1 year ago

I may be holding it wrong, but at least some unicode symbols are not supported as syntax:

e.g.:

processStateUpdater ∷
  ∀ a m.
  (NOMInput a, UpdateMonad m) ⇒
  Config →
  u  a →
  StateT (ProcessState a) m ([NOMError], ByteString)

gives me


(haskell [0, 0] - [5, 52]
  (top_splice [0, 0] - [5, 52]
    (exp_infix [0, 0] - [5, 52]
      (exp_apply [0, 0] - [1, 9]
        (exp_name [0, 0] - [0, 19]
          (variable [0, 0] - [0, 19]))
        (ERROR [0, 20] - [1, 5]
          (ERROR [0, 20] - [0, 23]))
        (exp_name [1, 6] - [1, 7]
          (variable [1, 6] - [1, 7]))
        (exp_name [1, 8] - [1, 9]
          (variable [1, 8] - [1, 9])))
      (operator [1, 9] - [1, 10])
      (exp_apply [2, 2] - [5, 52]
        (exp_tuple [2, 2] - [2, 29]
          (exp_apply [2, 3] - [2, 13]
            (exp_name [2, 3] - [2, 11]
              (constructor [2, 3] - [2, 11]))
            (exp_name [2, 12] - [2, 13]
              (variable [2, 12] - [2, 13])))
          (comma [2, 13] - [2, 14])
          (exp_apply [2, 15] - [2, 28]
            (exp_name [2, 15] - [2, 26]
              (constructor [2, 15] - [2, 26]))
            (exp_name [2, 27] - [2, 28]
              (variable [2, 27] - [2, 28]))))
        (ERROR [2, 30] - [2, 33]
          (ERROR [2, 30] - [2, 33]))
        (exp_name [3, 2] - [3, 8]
          (constructor [3, 2] - [3, 8]))
        (ERROR [3, 9] - [3, 12]
          (ERROR [3, 9] - [3, 12]))
        (exp_name [4, 2] - [4, 3]
          (variable [4, 2] - [4, 3]))
        (ERROR [4, 4] - [4, 7]
          (ERROR [4, 4] - [4, 7]))
        (exp_name [5, 2] - [5, 8]
          (constructor [5, 2] - [5, 8]))
        (exp_parens [5, 9] - [5, 25]
          (exp_apply [5, 10] - [5, 24]
            (exp_name [5, 10] - [5, 22]
              (constructor [5, 10] - [5, 22]))
            (exp_name [5, 23] - [5, 24]
              (variable [5, 23] - [5, 24]))))
        (exp_name [5, 26] - [5, 27]
          (variable [5, 26] - [5, 27]))
        (exp_tuple [5, 28] - [5, 52]
          (exp_list [5, 29] - [5, 39]
            (exp_name [5, 30] - [5, 38]
              (constructor [5, 30] - [5, 38])))
          (comma [5, 39] - [5, 40])
          (exp_name [5, 41] - [5, 51]
            (constructor [5, 41] - [5, 51])))))))
tek commented 1 year ago

right! I added some basics now, but there are some more missing.

maralorn commented 1 year ago

Thank you for the quick reaction. Yeah, those are probably the most important, nice.

Here is the list of all symbols, not so many are missing:

https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/unicode_syntax.html

tek commented 1 year ago

yep, already have that tab open :wink:

timtro commented 1 year ago

Thanks in advance. I wish I could help solve and not merely report the issue. But I'm getting errors when I simply use unicode characters in/as identifiers.

Possibly helpful links:

Example below, and please, don't judge me for the quality of this code. It's my first Haskell program, and it's fit for a very specific purpose which is not production. (It mirrors a theoretical construction in my PhD thesis in systems theory.)

{-# LANGUAGE InstanceSigs  #-}
{-# LANGUAGE UnicodeSyntax #-}

module PrAlgebra where

import           Data.Fix (Fix (Fix), foldFix, unFix)

(▽) :: (a → c) → (b → c) → Either a b → c
(▽) = either

(△) :: (b → c) → (b → c') → b → (c, c')
(△) f g x = (f x, g x)

newtype 𝘗ᵣ hd tl = Pᵣ (Maybe (tl, hd))

instance Functor (𝘗ᵣ hd) where
  fmap :: (a → b) → 𝘗ᵣ hd a → 𝘗ᵣ hd b
  fmap f (Pᵣ Nothing)         = Pᵣ Nothing
  fmap f (Pᵣ (Just (tl, hd))) = Pᵣ (Just (f tl, hd))

type 𝘗ᵣAlgebra state value =  𝘗ᵣ value state → state

type Snoc hd = Fix(𝘗ᵣ hd)

snoc :: Snoc a → a → Snoc a
snoc xs x = Fix (Pᵣ (Just (xs, x)))

In terms of syntax highlighting, everything is coloured as a type. Here is a screenshot where a constructor is being called a type. Screenshot from 2022-12-22 09-50-47

The tree is listed below.

pragma [0, 0] - [0, 30]
pragma [1, 0] - [1, 30]
module: module [3, 7] - [3, 16]
where [3, 17] - [3, 22]
ERROR [5, 0] - [25, 37]
  import [5, 0] - [5, 53]
    qualified_module [5, 17] - [5, 25]
      module [5, 17] - [5, 21]
      module [5, 22] - [5, 25]
    import_list [5, 26] - [5, 53]
      import_item [5, 27] - [5, 36]
        type [5, 27] - [5, 30]
        import_con_names [5, 31] - [5, 36]
          constructor [5, 32] - [5, 35]
      comma [5, 36] - [5, 37]
      import_item [5, 38] - [5, 45]
        variable [5, 38] - [5, 45]
      comma [5, 45] - [5, 46]
      import_item [5, 47] - [5, 52]
        variable [5, 47] - [5, 52]
  pat_literal [7, 0] - [7, 5]
    con_unit [7, 0] - [7, 5]
      ERROR [7, 1] - [7, 4]
        ERROR [7, 1] - [7, 4]
  type_parens [7, 9] - [7, 18]
    fun [7, 10] - [7, 17]
      type_name [7, 10] - [7, 11]
        type_variable [7, 10] - [7, 11]
      type_name [7, 16] - [7, 17]
        type_variable [7, 16] - [7, 17]
  type_parens [7, 23] - [7, 32]
    fun [7, 24] - [7, 31]
      type_name [7, 24] - [7, 25]
        type_variable [7, 24] - [7, 25]
      type_name [7, 30] - [7, 31]
        type_variable [7, 30] - [7, 31]
  type_apply [7, 37] - [7, 47]
    type_name [7, 37] - [7, 43]
      type [7, 37] - [7, 43]
    type_name [7, 44] - [7, 45]
      type_variable [7, 44] - [7, 45]
    type_name [7, 46] - [7, 47]
      type_variable [7, 46] - [7, 47]
  constraint [7, 52] - [25, 37]
    class: class_name [7, 52] - [7, 53]
      type_variable [7, 52] - [7, 53]
    type_literal [8, 0] - [8, 5]
      con_unit [8, 0] - [8, 5]
        ERROR [8, 1] - [8, 4]
          ERROR [8, 1] - [8, 4]
    ERROR [8, 6] - [8, 7]
    type_name [8, 8] - [8, 14]
      type_variable [8, 8] - [8, 14]
    type_literal [10, 0] - [10, 5]
      con_unit [10, 0] - [10, 5]
        ERROR [10, 1] - [10, 4]
          ERROR [10, 1] - [10, 4]
    ERROR [10, 6] - [10, 8]
    type_parens [10, 9] - [10, 18]
      fun [10, 10] - [10, 17]
        type_name [10, 10] - [10, 11]
          type_variable [10, 10] - [10, 11]
        type_name [10, 16] - [10, 17]
          type_variable [10, 16] - [10, 17]
    ERROR [10, 19] - [10, 22]
    type_parens [10, 23] - [10, 33]
      fun [10, 24] - [10, 32]
        type_name [10, 24] - [10, 25]
          type_variable [10, 24] - [10, 25]
        type_name [10, 30] - [10, 32]
          type_variable [10, 30] - [10, 32]
    ERROR [10, 34] - [10, 37]
    type_name [10, 38] - [10, 39]
      type_variable [10, 38] - [10, 39]
    ERROR [10, 40] - [10, 43]
    type_tuple [10, 44] - [10, 51]
      type_name [10, 45] - [10, 46]
        type_variable [10, 45] - [10, 46]
      comma [10, 46] - [10, 47]
      type_name [10, 48] - [10, 50]
        type_variable [10, 48] - [10, 50]
    type_literal [11, 0] - [11, 5]
      con_unit [11, 0] - [11, 5]
        ERROR [11, 1] - [11, 4]
          ERROR [11, 1] - [11, 4]
    type_name [11, 6] - [11, 7]
      type_variable [11, 6] - [11, 7]
    type_name [11, 8] - [11, 9]
      type_variable [11, 8] - [11, 9]
    type_name [11, 10] - [11, 11]
      type_variable [11, 10] - [11, 11]
    ERROR [11, 12] - [11, 13]
    type_tuple [11, 14] - [11, 24]
      type_apply [11, 15] - [11, 18]
        type_name [11, 15] - [11, 16]
          type_variable [11, 15] - [11, 16]
        type_name [11, 17] - [11, 18]
          type_variable [11, 17] - [11, 18]
      comma [11, 18] - [11, 19]
      type_apply [11, 20] - [11, 23]
        type_name [11, 20] - [11, 21]
          type_variable [11, 20] - [11, 21]
        type_name [11, 22] - [11, 23]
          type_variable [11, 22] - [11, 23]
    type_name [13, 0] - [13, 7]
      type_variable [13, 0] - [13, 7]
    ERROR [13, 8] - [13, 15]
      ERROR [13, 8] - [13, 15]
    type_name [13, 16] - [13, 18]
      type_variable [13, 16] - [13, 18]
    type_name [13, 19] - [13, 21]
      type_variable [13, 19] - [13, 21]
    ERROR [13, 22] - [13, 23]
    type_name [13, 24] - [13, 25]
      type [13, 24] - [13, 25]
    ERROR [13, 25] - [13, 28]
      ERROR [13, 25] - [13, 28]
    type_parens [13, 29] - [13, 45]
      type_apply [13, 30] - [13, 44]
        type_name [13, 30] - [13, 35]
          type [13, 30] - [13, 35]
        type_tuple [13, 36] - [13, 44]
          type_name [13, 37] - [13, 39]
            type_variable [13, 37] - [13, 39]
          comma [13, 39] - [13, 40]
          type_name [13, 41] - [13, 43]
            type_variable [13, 41] - [13, 43]
    type_name [15, 0] - [15, 8]
      type_variable [15, 0] - [15, 8]
    type_name [15, 9] - [15, 16]
      type [15, 9] - [15, 16]
    type_parens [15, 17] - [15, 29]
      ERROR [15, 18] - [15, 25]
        ERROR [15, 18] - [15, 25]
      type_name [15, 26] - [15, 28]
        type_variable [15, 26] - [15, 28]
    type_name [15, 30] - [15, 35]
      type_variable [15, 30] - [15, 35]
    type_name [16, 2] - [16, 6]
      type_variable [16, 2] - [16, 6]
    ERROR [16, 7] - [16, 9]
    type_parens [16, 10] - [16, 19]
      fun [16, 11] - [16, 18]
        type_name [16, 11] - [16, 12]
          type_variable [16, 11] - [16, 12]
        type_name [16, 17] - [16, 18]
          type_variable [16, 17] - [16, 18]
    ERROR [16, 20] - [16, 31]
      ERROR [16, 24] - [16, 31]
    type_name [16, 32] - [16, 34]
      type_variable [16, 32] - [16, 34]
    type_name [16, 35] - [16, 36]
      type_variable [16, 35] - [16, 36]
    ERROR [16, 37] - [16, 48]
      ERROR [16, 41] - [16, 48]
    type_name [16, 49] - [16, 51]
      type_variable [16, 49] - [16, 51]
    type_name [16, 52] - [16, 53]
      type_variable [16, 52] - [16, 53]
    type_name [17, 2] - [17, 6]
      type_variable [17, 2] - [17, 6]
    type_name [17, 7] - [17, 8]
      type_variable [17, 7] - [17, 8]
    type_parens [17, 9] - [17, 23]
      type_apply [17, 10] - [17, 22]
        type_name [17, 10] - [17, 11]
          type [17, 10] - [17, 11]
        ERROR [17, 11] - [17, 14]
          ERROR [17, 11] - [17, 14]
        type_name [17, 15] - [17, 22]
          type [17, 15] - [17, 22]
    ERROR [17, 32] - [17, 33]
    type_name [17, 34] - [17, 35]
      type [17, 34] - [17, 35]
    ERROR [17, 35] - [17, 38]
      ERROR [17, 35] - [17, 38]
    type_name [17, 39] - [17, 46]
      type [17, 39] - [17, 46]
    type_name [18, 2] - [18, 6]
      type_variable [18, 2] - [18, 6]
    type_name [18, 7] - [18, 8]
      type_variable [18, 7] - [18, 8]
    type_parens [18, 9] - [18, 31]
      type_apply [18, 10] - [18, 30]
        type_name [18, 10] - [18, 11]
          type [18, 10] - [18, 11]
        ERROR [18, 11] - [18, 14]
          ERROR [18, 11] - [18, 14]
        type_parens [18, 15] - [18, 30]
          type_apply [18, 16] - [18, 29]
            type_name [18, 16] - [18, 20]
              type [18, 16] - [18, 20]
            type_tuple [18, 21] - [18, 29]
              type_name [18, 22] - [18, 24]
                type_variable [18, 22] - [18, 24]
              comma [18, 24] - [18, 25]
              type_name [18, 26] - [18, 28]
                type_variable [18, 26] - [18, 28]
    ERROR [18, 32] - [18, 33]
    type_name [18, 34] - [18, 35]
      type [18, 34] - [18, 35]
    ERROR [18, 35] - [18, 38]
      ERROR [18, 35] - [18, 38]
    type_parens [18, 39] - [18, 56]
      type_apply [18, 40] - [18, 55]
        type_name [18, 40] - [18, 44]
          type [18, 40] - [18, 44]
        type_tuple [18, 45] - [18, 55]
          type_apply [18, 46] - [18, 50]
            type_name [18, 46] - [18, 47]
              type_variable [18, 46] - [18, 47]
            type_name [18, 48] - [18, 50]
              type_variable [18, 48] - [18, 50]
          comma [18, 50] - [18, 51]
          type_name [18, 52] - [18, 54]
            type_variable [18, 52] - [18, 54]
    type_name [20, 0] - [20, 4]
      type_variable [20, 0] - [20, 4]
    ERROR [20, 5] - [20, 12]
      ERROR [20, 5] - [20, 12]
    type_name [20, 12] - [20, 19]
      type [20, 12] - [20, 19]
    type_name [20, 20] - [20, 25]
      type_variable [20, 20] - [20, 25]
    type_name [20, 26] - [20, 31]
      type_variable [20, 26] - [20, 31]
    ERROR [20, 32] - [20, 42]
      ERROR [20, 35] - [20, 42]
    type_name [20, 43] - [20, 48]
      type_variable [20, 43] - [20, 48]
    type_name [20, 49] - [20, 54]
      type_variable [20, 49] - [20, 54]
    ERROR [20, 55] - [20, 58]
    type_name [20, 59] - [20, 64]
      type_variable [20, 59] - [20, 64]
    type_name [22, 0] - [22, 4]
      type_variable [22, 0] - [22, 4]
    type_name [22, 5] - [22, 9]
      type [22, 5] - [22, 9]
    type_name [22, 10] - [22, 12]
      type_variable [22, 10] - [22, 12]
    ERROR [22, 13] - [22, 14]
    type_name [22, 15] - [22, 18]
      type [22, 15] - [22, 18]
    type_parens [22, 18] - [22, 30]
      ERROR [22, 19] - [22, 26]
        ERROR [22, 19] - [22, 26]
      type_name [22, 27] - [22, 29]
        type_variable [22, 27] - [22, 29]
    type_name [24, 0] - [24, 4]
      type_variable [24, 0] - [24, 4]
    ERROR [24, 5] - [24, 7]
    type_name [24, 8] - [24, 12]
      type [24, 8] - [24, 12]
    type_name [24, 13] - [24, 14]
      type_variable [24, 13] - [24, 14]
    ERROR [24, 15] - [24, 18]
    type_name [24, 19] - [24, 20]
      type_variable [24, 19] - [24, 20]
    ERROR [24, 21] - [24, 24]
    type_name [24, 25] - [24, 29]
      type [24, 25] - [24, 29]
    type_name [24, 30] - [24, 31]
      type_variable [24, 30] - [24, 31]
    type_name [25, 0] - [25, 4]
      type_variable [25, 0] - [25, 4]
    type_name [25, 5] - [25, 7]
      type_variable [25, 5] - [25, 7]
    type_name [25, 8] - [25, 9]
      type_variable [25, 8] - [25, 9]
    ERROR [25, 10] - [25, 11]
    type_name [25, 12] - [25, 15]
      type [25, 12] - [25, 15]
    type_parens [25, 16] - [25, 37]
      type_apply [25, 17] - [25, 36]
        type_name [25, 17] - [25, 18]
          type [25, 17] - [25, 18]
        ERROR [25, 18] - [25, 21]
          ERROR [25, 18] - [25, 21]
        type_parens [25, 22] - [25, 36]
          type_apply [25, 23] - [25, 35]
            type_name [25, 23] - [25, 27]
              type [25, 23] - [25, 27]
            type_tuple [25, 28] - [25, 35]
              type_name [25, 29] - [25, 31]
                type_variable [25, 29] - [25, 31]
              comma [25, 31] - [25, 32]
              type_name [25, 33] - [25, 34]
                type_variable [25, 33] - [25, 34]
tek commented 1 year ago

I added three more symbols for built-in syntax.

I also took a look at the symbolic operator situation, and it's a little bit more difficult. Legal characters for these varsyms are determined by membership in unicode categories, which contain about 6000 code points in noncontiguous intervals.

We are parsing varsyms in the scanner, which means we don't have access to the unicode category regex classes that are provided by tree-sitter. I couldn't find a method to do this in standard C, but maybe someone knows better? For what it's worth, I tried adding a switch with 6k cases and performance only degraded by about 1%.

maralorn commented 1 year ago

I am not sure, what the rules here are, but would it be terrible to over-approximate here? (Also don’t know if it would simplify things) I would assume that by allowing a larger class of unicode symbols that is maybe easier to check it would be unlikely to miss-parse valid Haskell?

tek commented 1 year ago

possibly, but I'm absolutely uncertain. 6k code points in a range of 130k seems quite disproportionate, and they are spaced out pretty wide. We could try > N for some value and test all smaller ones explicitly. But since performance doesn't take a significant hit, we could also just put the 6k cases in a separate file in a switch and be done with it 🙃

maralorn commented 1 year ago

Your call. I would also wonder a bit how much bigger the grammar would become …

tek commented 1 year ago

the haskell.so grows by 10kB. (total 3.6MB)

tek commented 1 year ago

the arrow notation operators appear not to be within the categories used for the PR we just merged. also unsure about those banana brackets, they would probably need special treatment.