node.endIndex incorrect?

sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions

http://pegdown.org

Apache License 2.0

1.29k stars 218 forks source link

node.endIndex incorrect? #199

Closed radicaled closed 8 years ago

radicaled commented 8 years ago

For the input string #1234567, pegdown generates an AST that tells me that this HeaderNode starts at index 0 and ends at index 9. This dump is taken directly from the debugger:

result = {HeaderNode@3837} "HeaderNode [0-9]"
 level = 1
 isSetext = false
 children = {ArrayList@3844}  size = 1
  0 = {TextNode@3848} "TextNode [1-8] '1234567'"
   sb = {StringBuilder@3850} "1234567"
   startIndex = 1
   endIndex = 8
 startIndex = 0
 endIndex = 9

The actual length of #1234567 is 8 characters, of course, so I would expect that for the header node it's startIndex is 0 and its endIndex is 7, not 9.

I'm using pegdown 1.6.0

vsch commented 8 years ago

endIndex is one character past the last character of the range. ie. Range is [startIndex, endIndex), start is included, end is not.

TextNode range is [1,8) which is 7 characters.

The header node includes the \n at the end of the line. Hence, the [0,9) is the text range.

radicaled commented 8 years ago

endIndex is one character past the last character of the range. ie. Range is [startIndex, endIndex), start is included, end is not.

Thanks.

The header node includes the \n at the end of the line. Hence, the [0,9) is the text range.

So, my sample string doesn't actually have a \n in it -- does this mean that \n is assumed to be present by the parser? In other words, I should account for the parser assuming an extra character at the end of every string input?

vsch commented 8 years ago

If you look in the code of the pegdown processor it appends \n\n to all input text. Otherwise the parser will not recognize the blocks since they must end with and end of line and be followed by a blank line. So adding \n\n guarantees that the last block will be properly recognized.

So if end range not going outside of your original string is critical, make sure you truncate all endIndex values returned by the AST to the length of the original string you passed to pegdown.

I agree, it is a pain but I do it in my plugin because IDEA really does not like index being outside the text range. Modifying the parser seemed like it might break backward compatibility, so I did not venture there.

radicaled commented 8 years ago

If you look in the code of the pegdown processor it appends \n\n to all input text.

Ahh, OK, OK. I was only looking at my source string, assuming the input wouldn't be changed once it got into the processor...

So if end range not going outside of your original string is critical, make sure you truncate all endIndex values returned by the AST to the length of the original string you passed to pegdown.

Thanks. This will help me avoid some bugs in the future.