grammar doesn't handle CR/LF line endings

gregnis commented 1 month ago

I'm not sure why but I cannot get a valid tree parsing the sample INI file you provided:

[section name]
some_key = some_value
another-key = another value

[another section]
# a comment
some_key = some_value
another-key = another value

I'm using the latest code for TreeSitter library from https://github.com/tree-sitter/tree-sitter. The code goes like this:

TSParser* parser = ts_parser_new();
ts_parser_set_language(parser, tree_sitter_ini());
TSTree* tree = ts_parser_parse_string_encoding(parser, ...

This woks for other languages (cpp, csharp etc.) but while parsing the example above, I get tons of errors that I can see using the ts_tree_print_dot_graph function. It's a long output, here's the top:

digraph tree { edge [arrowhead=none] tree_0175EEC8 [label="document", tooltip="range: 0 - 308 state: 65535 error-cost: 4784 has-changes: 0 depends-on-column: 0 descendant-count: 39 repeat-depth: 0 lookahead-bytes: 1"] tree_0180A748 [label="ERROR", fontcolor=gray, tooltip="range: 0 - 138 state: 0 error-cost: 2428 has-changes: 0 depends-on-column: 0 descendant-count: 17 repeat-depth: 0 lookahead-bytes: 3"] tree_01807308 [label="[", shape=plaintext, tooltip="range: 0 - 2 state: 1 error-cost: 0 has-changes: 0 depends-on-column: 0 descendant-count: 0 repeat-depth: 0 lookahead-bytes: 1"] tree_0180A748 -> tree_01807308 [tooltip=0] tree_01807310 [label="text", shape=plaintext, tooltip="range: 2 - 26 state: 19 error-cost: 0 has-changes: 0 depends-on-column: 0 descendant-count: 0 repeat-depth: 0 lookahead-bytes: 1"]

I attached the whole output. Do you know what's going on and to make it work?

Thanks!

tree_graph.txt

gregnis commented 1 month ago

Let me update this. I was able to use CLI to parse this simple INI file. Here's what I got:

(document [0, 0] - [8, 0]
  (ERROR [0, 0] - [3, 1]
    (text [0, 1] - [0, 13])
    (ERROR [0, 14] - [0, 15])
    (setting_name [1, 0] - [1, 8])
    (setting_name [1, 11] - [1, 21])
    (ERROR [1, 21] - [1, 22])
    (setting_name [2, 0] - [2, 11])
    (setting_name [2, 14] - [2, 21])
    (setting_name [2, 22] - [2, 27])
    (ERROR [2, 27] - [2, 28])
    (ERROR [3, 0] - [3, 1]))
  (ERROR [4, 0] - [7, 28]
    (text [4, 1] - [4, 16])
    (ERROR [4, 17] - [4, 18])
    (comment [5, 0] - [6, 0]
      (text [5, 1] - [5, 12]))
    (setting_name [6, 0] - [6, 8])
    (setting_name [6, 11] - [6, 21])
    (ERROR [6, 21] - [6, 22])
    (setting_name [7, 0] - [7, 11])
    (setting_name [7, 14] - [7, 21])
    (setting_name [7, 22] - [7, 27])
    (ERROR [7, 27] - [7, 28])))

As you can see, there are ERRORs instead of sections and keys.

gregnis commented 1 month ago

Here's a mystery for me: when I parse the same file in my local playground, I get:

document [0, 0] - [9, 0]
  section [0, 0] - [3, 0]
    section_name [0, 0] - [1, 0]
      text [0, 1] - [0, 13]
    setting [1, 0] - [2, 0]
      setting_name [1, 0] - [1, 8]
      setting_value [1, 10] - [1, 21]
    setting [2, 0] - [3, 0]
      setting_name [2, 0] - [2, 11]
      setting_value [2, 13] - [2, 27]
  section [4, 0] - [8, 0]
    section_name [4, 0] - [5, 0]
      text [4, 1] - [4, 16]
    comment [5, 0] - [6, 0]
      text [5, 1] - [5, 11]
    setting [6, 0] - [7, 0]
      setting_name [6, 0] - [6, 8]
      setting_value [6, 10] - [6, 21]
    setting [7, 0] - [8, 0]
      setting_name [7, 0] - [7, 11]
      setting_value [7, 13] - [7, 27]

That's perfectly fine. So why can't I get the same result using my

TSParser* parser = ts_parser_new();
ts_parser_set_language(parser, tree_sitter_ini());
TSTree* tree = ts_parser_parse_string_encoding(parser, ...

or using treesitter parse command?

gregnis commented 1 month ago

I think I know what's going on. The grammar doesn't handle CR/LF line endings, only LF. That's why the playground works but the parse command does not, for files with Windows line endings.

I changed the grammar to the one below, and it appears to work on both types of files.

module.exports = grammar({
  name: 'ini',

  extras: $ => [
    $.comment,
    $._blankLF,
    $._blankCRLF,
    /[\t ]/
  ],

  rules: {
    document: $ => seq(
      repeat($._blankLF),  // Eat blank lines at top of file.
      repeat($._blankCRLF),  // Eat blank lines at top of file.
     repeat($.section),
    ),

    // Section has:
    // - a title
    // - zero or more settings (name=value pairs)
    section: $ => prec.left(seq(
      $.section_name,
      repeat(seq(
        $.setting,
      )),
    )),

    section_name: $ => seq(
      '[',
      alias(/[^\[\]\r?\n]+/, $.text),
      ']',
      choice('\n','\r\n'),
    ),

    setting: $ => seq(
      alias(/[^;#=\s\[]+/, $.setting_name),
      '=',
      alias(/.+/, $.setting_value),
      choice('\n','\r\n'),
    ),

    // setting_name: () => /[^#=\s\[]+/,
    // setting_value: () => /[^#\n]+/,
    comment: $ => seq(/[;#]/, alias(/.*/, $.text), optional('\r'), '\n'),

    _blankLF: () => field('blank', '\n'),
    _blankCRLF: () => field('blank', '\r\n'),
  }
});

I'm sure it's naïve (my first attempt at changing a grammar) but it seems to work for me. Please let me know if there is a better way.

justinmk commented 1 month ago

The grammar doesn't handle CR/LF line endings, only LF.

Yeah that's probably the case. Can you send a PR (with a test)?

gregnis commented 1 month ago

I don't have a setup to do this, perhaps you can use the code I provided to create a PR (assuming it's good).

justinmk / tree-sitter-ini

grammar doesn't handle CR/LF line endings #10