goccmack / gocc

Parser / Scanner Generator
Other
622 stars 48 forks source link

Accessing source code line number from parsed AST nodes #14

Closed edwardmp closed 8 years ago

edwardmp commented 8 years ago

Hi,

Thank you for publishing this tool. I am using it for a university assignment. This is not an issue, but more of question/request.

I want to access source code information (line number, etc) from the AST nodes somehow. I want to display informative error messages when my later semantic analysis fails.

Is it possible to do this? Thanks in advance.

awalterschulze commented 8 years ago

Did you find the information in the token?

edwardmp commented 8 years ago

Yes, thank you it was already answered here: https://github.com/goccmack/gocc/issues/9

edwardmp commented 8 years ago

As a matter of fact, do have a question regarding this. I have the following file:

form taxOfficeExample { 
  "Did you sell a house in 2010?"
    hasSoldHouse: boolean
  "Did you buy a house in 2010?"
    hasBoughtHouse: boolean
  "Did you enter a loan?"
    hasMaintLoan: boolean

  if (hasSoldHouse) {
    "What was the selling price?"
      sellingPrice: integer
    "Private debts for the sold house:"
      privateDebt: integer
    "Value residue:" valueResidue: integer = (sellingPrice - privateDebt)
  }
}

Yet, for example "sellingPrice" is reported as being on line 25, while this file has only 16 lines (?)

awalterschulze commented 8 years ago

Yes the line numbers in gocc is a known issue, which we want a fix for.

edwardmp commented 8 years ago

@awalterschulze thanks. Are you planning on fixing this in the near short term (within 1 week)? Because I use it for a university assignment and if the position info is incorrect I might as well not include it.

awalterschulze commented 8 years ago

We want a fix, but we are both quite busy at the moment, so I wouldn't expect a fix in the next week.

edwardmp commented 8 years ago

@awalterschulze OK. Any clue in which files you expect the bug to be in? Maybe I can try to fix it myself.

awalterschulze commented 8 years ago

I am pretty sure, but not certain, that the lexer continues to count line numbers ahead of where possible errors can occur. I would start looking here https://github.com/goccmack/gocc/blob/master/lexer/gen/golang/lexer.go And focus on the line int field of the lexer struct in the lexerSrc string.

mewmew commented 8 years ago

As noted by Walter, we have two options for resolving this issue. Either fix the lexer so that it reports accurate line:col pairs, or recompute them from the original input using the byte offset of tokens.

The offset seems be the only accurate number. So we could probably use that to calculate the line and column by running through the input again. This is if newError has access to the original input buffer?

We may take inspiration from go/token.Pos which stores positional information in a single integer (thus using less memory per token), while still making it possible to translate from position to line:col pairs, using go/token.File.Position. Note that the Go lexer and parser uses a notion of file sets, which may, or may not be appropriate to add to Gocc, if this is the approach which after discussions seems most reasonable for resolving the original issue.

awalterschulze commented 8 years ago

Yes that all sounds great :)

I have never thought about filesets before, but maybe that is not as important right now?

mewmew commented 8 years ago

I have never thought about filesets before, but maybe that is not as important right now?

I agree, file sets would be an unnecessary abstraction at this point. It's better to focus our attention on fixing the inaccurate file:col rather than introducing further complexities which may bring about other sources of inaccuracy. We can always revisit this decision in the future.

edwardmp commented 8 years ago

OK, I think I've fixed it.

Diagnosis: I noticed that after newline is encountered, the position would be x, but the next character (after the newline) would start again at character x as well and not x+1.

This in turn occurs because this.pos gets set to the value of end here which is x - 1, which in the next call to Scan() would be incremented by the rune size (usually 1) and would equal x again:

if end > start {
    this.pos = end

This in turn led to this condition not being evaluated to true when it should: this.pos >= len(this.src)

To fix this, I set end = this.pos in the case block for newline:

case '\n':
    this.line++
    this.column = 1
    end = this.pos

I will submit a pull request in a few minutes!

sangisos commented 8 years ago

@edwardmp I seem to be getting an error where the new line character have become part of the last token on a line:

2016/09/23 14:12:47 path: ../testdata/minimal/a.ll
--- FAIL: TestParser (0.00s)
parser_test.go:237: "../testdata/minimal/a.ll": error parsing file; github.com/llir/llvm/asm/internal/parser_test.TestParser (parser_test.go:237): error: Error in S201: }(49,}
), Pos(offset=33, line=4, column=1)github.com/llir/llvm/asm/internal/irx.NewRetInst (irx.go:787): error: unable to parse integer constant "42\n"
FAIL
exit status 1
FAIL github.com/llir/llvm/asm/internal/parser 0.015s

parsing the minimal test case a.ll

the new line character is being lexed as part of the last token: "42\n". Also the line number indicated is still wrong (4 instead of 2 in a 3 line file).

Did this work for you? I'll have to look through both my code and the gocc lexer code.

EDIT: sorry, just saw the comments on the pull request 19. I'll try to investigate.

sangisos commented 8 years ago

Sorry, Please ignore my first commit, I forgot to remove extra stuff I tested.

Two new commits that hopefully fixes the line:column bug are https://github.com/sangisos/gocc/commit/e3f85c18a03d9da89bb5dc2209b3463e6df1a612 https://github.com/sangisos/gocc/commit/e6354caaf6581806c90f52cba14f7f7e8e2c9bd4 @mewmew Could you verify this fix before I make a pull request?

Also, I wanted to remove, fix and clean up some of the debug/commented out print outs in the file lexer/gen/golang/lexer.go, for example the broken lines 198 and 201, but didn't dare to.

I also have a problem with tabs being hard coded as 4 columns instead of 1, Also, though I can't find any sources for this, it seems terminal emulators tend to use tabspace 8 instead of 4. This means that if you try to use the column number to point at or underline some text, you most likely will miss. That is a bad way to do it anyway, as you can use tab stops in VT100. Anyhow, both gcc and clang uses 1 column for each tab, which seems more predictable.

mewmew commented 8 years ago

@mewmew Could you verify this fix before I make a pull request?

Hej! The week will be a busy one, but I may get time this weekend. Glad to see you working on this :) Will be great with accurate line tracking information.

Edit: The weekend has come, but no time for hacking in sight. @sangisos I'll let you know, once I've had a chance to take a look, hopefully it won't take too long.

mewmew commented 8 years ago

@sangisos, I looked through https://github.com/sangisos/gocc/commit/e3f85c18a03d9da89bb5dc2209b3463e6df1a612 the other day, and it looks good. Basically, it only updates the line, column position if the active state is not NoState.

I also tried to look through https://github.com/sangisos/gocc/commit/e6354caaf6581806c90f52cba14f7f7e8e2c9bd4 but lacking context, I failed to understand exactly why startLine and startColumn needed to be saved and restored. Why was the original code broken? Which part updated this.line and this.column needlessly? The thing I can see is that the new code now mirrors the behaviour of storing start from this.pos, thus making it more consistent.

I'd say the debug output is fine.

Also, I wanted to remove, fix and clean up some of the debug/commented out print outs in the file lexer/gen/golang/lexer.go, for example the broken lines 198 and 201, but didn't dare to.

I tried to understand what the specific issue was at line 198 and 201, but couldn't quite figure it out. Do you mean they are broken as in, they don't compile, or how are they broken?

I also have a problem with tabs being hard coded as 4 columns instead of 1

Tab size is always an issue, as users can change it to anything they desire. For what it's worth, the default tab size of gofmt is 8.

After merging commit https://github.com/sangisos/gocc/commit/e3f85c18a03d9da89bb5dc2209b3463e6df1a612 and https://github.com/sangisos/gocc/commit/e6354caaf6581806c90f52cba14f7f7e8e2c9bd4, lines and columns are correctly accounted for as far as I can tell. Have been running a few test cases using uc.

So @sangisos, LGTM :) Create a pull request.

awalterschulze commented 8 years ago

Is there a way we could add a small test, maybe using one of the already existing example grammars? The test can as simple as possible. It just has to breaking with the current code and be fixed with the pull request.

sangisos commented 8 years ago

@mewmew Without startLine and startColumn the line and column reported for the token would be the end of the token, as line and column is continually updated as the token is read.

@goccmack This fix will render at least the internal use of the new GetPosition function in https://github.com/goccmack/gocc/commit/a149e4afad13338a1a1f10db3ce385bb02b1ee62 obsolete.

Also in commit https://github.com/goccmack/gocc/commit/b1f832899b91977f13aeea0612d186f3a897e2df, the handling off tabs are now variable, Defaulting to 4.

@mewmew The lines mentioned are broken, as in will not compile, if uncommented. The used variables, types or functions are no longer available. EDIT: sorry, I had linked to master, so the lines changed with the latest commits, the links are now fixed.

@awalterschulze I will learn how to rebase I guess, and try to add a test case during the week, when I get access to my computer.

mewmew commented 8 years ago

@mewmew Without startLine and startColumn the line and column reported for the token would be the end of the token, as line and column is continually updated as the token is read.

Ah, that makes sense. Thanks for clarifying.

@mewmew The lines mentioned are broken, as in will not compile, if uncommented. The used variables, types or functions are no longer available. EDIT: sorry, I had linked to master, so the lines changed with the latest commits, the links are now fixed.

Ah, I see. I was staring at the line } and trying to figure out what you saw that I did not.

for example the broken lines 198 and 201, but didn't dare to.

I'd say go for it!

goccmack commented 8 years ago

@sangisos I am happy if my code is obsoleted by an improvement. Thank you.

awalterschulze commented 8 years ago

I am really looking forward to this :)

sangisos commented 8 years ago

@awalterschulze sorry, I've had trouble with how to do this and how much of @goccmack's code I should remove.

As I have read, a rebase after a repo is pushed online is bad form, is a merge okay? Or can I do a pull request from the old stuff that is up now and you could try to do a "rebase and merge" in github?

Should I use DEFAULT_TABSIZE in the original code at L145, and remove most of the code @goccmack added or are there more uses for @goccmack's added functionality?

And as previously mentioned, not sure about the whole case '\t':. I think it could be removed all together, as it will mostly cause confusion if set to anything else then 1. Try echo -e "\ttest" and echo -e "␣␣␣\ttest" in a console. Thanks to tabstops, test will be printed after 1 tabstop (column 8 usually) in either case. In a source, gocc will report something indented like test as starting in col 5 and 8 respectively, while gcc and clang will report it at 2 and 5 respectively, even though their graphic representation differs from each other. I therefore ask to at least set DEFAULT_TABSIZE to 1 for consistency.

Sorry, I'm not very skilled at git or open source or any kind of production development. If someone feels they have an easy way to incorporate my changes I'd say go for it, I do not need the credit =), but I'll try to fix it.

mewmew commented 8 years ago

As I have read, a rebase after a repo is pushed online is bad form, is a merge okay? Or can I do a pull request from the old stuff that is up now and you could try to do a "rebase and merge" in github?

GitHub now has support for squishing commits when merging, which may alleviate the need to rebase https://github.com/blog/2141-squash-your-commits.

Someone else probably has more insight into if DEFAULT_TABSIZE is to be used.

Sorry, I'm not very skilled at git or open source or any kind of production development. If someone feels they have an easy way to incorporate my changes I'd say go for it, I do not need the credit =), but I'll try to fix it.

Create a pull request from sangisos/gocc and the changes may be squished and merged.

mewmew commented 8 years ago

@sangisos There is a first for anything in life really, glad to see you getting more involved with open source projects :)

awalterschulze commented 8 years ago

Yes just create a pull request, no need to learn about rebasing and such. Although its useful, its not applicable in this case. The github web ui can do all the hard work for you :)

@goccmack will have to answer about DEFAULT_TABSIZE

goccmack commented 8 years ago

@sangisos I made a quick fix to the error reporting problem while using gocc to generate a compiler for an unrelated project. Please feel free to remove/replace any/all of that code that you think necessary to implement a proper fix.

sangisos commented 8 years ago

Okay, I will then revert most of those commits, and change DEFAULT_TABSIZE to defaultTabsize and set it to 1.

As I was trying to add a test case, I stumbled upon another question. Should all the generated code in test/t1 really be checked in? Isn't it better to have it generated before the test every time? Because of older changes to gocc, most files in test/t1 are changed when regenerating the test/t1 with gocc.

The same is also true for all examples, their generated parts should probably be regenerated or removed.

mewmew commented 8 years ago

@sangisos May we close this issue now, since #30 has been merged?

sangisos commented 8 years ago

@mewmew The specific question in the issue title was actually answered in the first comment, so you can close this i would say.