lark-parser / Lark.js

Live port of Lark's standalone parser to Javascript
MIT License
71 stars 12 forks source link

Incorrect column info for unexpected token exception #27

Closed jillyj closed 2 years ago

jillyj commented 2 years ago

Grammar file: https://github.com/opencybersecurityalliance/kestrel-lang/blob/release/src/kestrel/syntax/kestrel.lark Generated parser: kestrelParser.js.zip

When parsing this statement var=get, the parser throws the unexpected token exception with

e.line =1
e.column=5

However, the column should be 7. image

Same incorrect column info for the following test strings. var=get file, e.column is 7, but should be 12. var=get file from, e.column is 14, but should be 17. var=get file from abc, e.column is 19, but should be 21.

jillyj commented 2 years ago

@erezsh would you please take a look at this issue? Thanks a lot!

erezsh commented 2 years ago

I tried the first example you gave, and I got

  token: Token {
    type: '$END',
    start_pos: 8,
    value: '',
    line: 1,
    column: 9,
    end_line: 1,
    end_column: 13,
    end_pos: 12
  },

This is the same answer you get from the Python version.

It's not the end of the file (you can find that easily on your own), but the last valid position the parser was able to reach.

We can argue if that's the right thing to return or not, but it seems like everything is working in order.

(I don't know why you got 7. Make sure you're using the latest commit)

jillyj commented 2 years ago

The version I use is 0.1.3. This is what I got for statement var=get. image

jillyj commented 2 years ago

I also tried to install lark-js again from repo using command pip3 install -e git+https://github.com/lark-parser/Lark.js.git#egg=lark-js, and the result is the same..

erezsh commented 2 years ago

Can you post a reproducing script? (a js file that, when run, reproduces the error. Plus the grammar file ofc)

jillyj commented 2 years ago

Sure. The grammar file and the generated parser JS file is attached in the Description field of this issue.

My code to do parsing looks like below.

const kestrel_parser = require('./parser/kestrelParser');
const {get_parser, UnexpectedCharacters, UnexpectedToken} = kestrel_parser;
const parser = get_parser({keep_all_tokens: true});
function App() {
  let treeData = null;
  let errorMsg = '';
  function handle_errors(e) {
    console.debug(e.line, e.column)
    if (e instanceof UnexpectedCharacters) {
      if (errorMsg.length === 0) errorMsg = `Unexpected characters "${e.char}" at position ${e.column}`;
    } else if (e instanceof UnexpectedToken) {
      // print the 1st encountered error
      if (errorMsg.length === 0) errorMsg = `Unexpected token "${e.token.value}" at ${e.token.type} position ${e.column}, expected ${[...e.expected].join(',')}`;
    } else if (e instanceof SyntaxError) {
      console.debug(e)
    } else {
      console.debug("unknown error:", e.constructor.name)
    }
    // return ture to keep parsing
    return true;
  }

  try {
    treeData = parser.parse("var=get", null, handle_errors).children[0];
  } catch (e) {
    console.debug("uncaught error:", e)
  }
}
erezsh commented 2 years ago

I don't see the problem?

For var=get file it's 9

For var=get it's 5

Everything seems in order

jillyj commented 2 years ago

Okay, so the column means the token "start" position? Hm..then what I need should be end_pos. Thanks.

erezsh commented 2 years ago

Yes, it's the start of the last valid position, which in this case is the start of the token that caused the error.

(to the best of my memory)