sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions
http://pegdown.org
Apache License 2.0
1.29k stars 218 forks source link

White spaces in links reference are not supported #216

Open tmortagne opened 8 years ago

tmortagne commented 8 years ago

We have links reference that can contain white spaces but the following

[labal](Space.Space - NiceName)

is not even interpreted as a link.

vsch commented 8 years ago

@tmortagne, it isn't a link. A link would be [labal](Space.Space - NiceName)

The parentheses and brackets should be reversed.

Reference links are only interpreted as such if the reference link resolves to a real reference. Otherwise, it is treated as a "fake" reference link and is output as is during HTML serialization.

Can you share a sample file of your markdown that you think is not being interpreted correctly so I can take a look at it? Maybe there is a bug in the parsing.

tmortagne commented 8 years ago

Yes I meant

[labal](Space.Space - NiceName)

it was a bad copy/paste.

Trefex commented 8 years ago

@vsch This was my fault. I reported the link wrongly. The syntax has been updated by @tmortagne in the meantime.

With the correct syntax, it doesn't work :)

tmortagne commented 8 years ago

Reference links are only interpreted as such if the reference link resolves to a real reference. Otherwise, it is treated as a "fake" reference link and is output as is during HTML serialization.

We don't use HTML serialization, we use pegdown as parser only and have our own reference sytax which does not seems to collide with link syntax in this case so I was expecting pegdown to just extract it and give it.

vsch commented 8 years ago

@tmortagne , that too is correct because the link should provide a URL and spaces are not part of acceptable URL characters. You need to URL encode your link, spaces are either + or %20 the complete link syntax for markdown is:

[label](url_address "title")

as you can see after the space pegdown expects to see a ) or a double quoted string. Anything else it interprets as not a link.

vsch commented 8 years ago

BTW, you can use spaces for links but only in wiki links [[wiki link]] which will link to a file named wiki-link.html, but only if you have the WIKILINKS extension enabled.

tmortagne commented 8 years ago

the link should provide a URL

Ok the part I was missing is the fact that the reference is URL syntax so indeed it's a bug on our side and we should pass the reference in a URL decoder before dealing with it then. Thanks for the explanations.

tmortagne commented 8 years ago

BTW, you can use spaces for links but only in wiki links [[wiki link]] which will link to a file named wiki-link.html, but only if you have the WIKILINKS extension enabled.

hmm actually I did not know there was several kinds of links, but is there a way to define a label with wiki links ?

vsch commented 8 years ago

@tmortagne, I fixed an issue with link parsing in my fork: https://github.com/vsch/pegdown in the develop branch that deals with not using the serializer but working with the AST directly.

The current implementation makes it impossible to tell the difference between [reference] and [reference][] with a dummy reference because it sets node.referenceKey to null for the RefLinkNode and RefImageNode in both cases. I modified the parser to use a non-null dummy reference node for the latter syntax but this requires that you test node.referenceKey != null && node.referenceKey.getChildren() != 0 to detect if the node.getChildren() represents a label for the ref link or text and the reference is in the node.referenceKey.

I haven't had the chance to prepare and post a PR to have these changes merged into the next pegdown release. I don't know if that will help your use case, but thought I'd mention it.

I too use the AST directly to generate a parse tree for my Intellij plugin idea-multimarkdown and pegdown has a few limitations in that department that I am fixing as I find them.

vsch commented 8 years ago

@tmortagne, yes: [[Wiki Link|Wiki Label]] the order of the two is implemented in the LinkRenderer. The parser does not care and just gives you the text between [[ and ]]. I use a modified LinkRenderer to apply GitHub Wiki Link syntax which is the reverse of the standard 'Creole' syntax pegdown uses: [[Wiki Label|Wiki Link]]

If you have the flexibility wiki links are probably the option for you since it doesn't care about the contents between delimiters.

vsch commented 8 years ago

@tmortagne, if you want wiki links to be recognized just remember to pass Extensions.WIKILINKS as an option to the parser, otherwise it will ignore the wiki links in the input.

tmortagne commented 8 years ago

Actually we do support wiki links but I'm not the one who worked on that so I'm discovering it. But looks like @Trefex need for label is not covered by this syntax.

tmortagne commented 8 years ago

The parser does not care and just gives you the text between [[ and ]].

It's not exactly what I would call label support if I do the parsing myself but Ok I see what you mean :) At least the standard syntax support it so I should create an issue about properly support wiki link label syntax on our side.

vsch commented 8 years ago

@tmortagne, I agree but pegdown is geared more towards html generation and takes liberties in the parser that are compensated in the ToHtmlSerializer.

You really need to look at the processing done for each node type in the ToHtmlSerializer() to make sure you handle all the gotchas in the parser. There are quite a few that I tripped up on, so the suggestion is a voice of experience.

vsch commented 8 years ago

@tmortagne in my plugin that has to build a parse tree for IntelliJ I go through hoops and wind up doing secondary lexing of the node text to handle its limitations. I pain, I must admit. Still a lot less work than rewriting and testing the parser from scratch. I just fix the bugs and add a few extensions here and there to deal with the edge cases.

tmortagne commented 8 years ago

Still a lot less work than rewriting and testing the parser from scratch

I'm not considering getting rid of pegdown, it's still great so far :)

tmortagne commented 8 years ago

I guess this issue is more or less a duplicate of #92.

vsch commented 8 years ago

@tmortagne, when I added wiki link label support I did not modify the parser to avoid backward compatibility but it would be easy to add to the parser so that Wiki link nodes will have a label and address fields. This would also allow adding an extension flag to select between Creole and GitHub syntax differences that currently is not possible because LinkRenderer does get the parser options.

It is an interesting option.

vsch commented 8 years ago

@tmortagne, I think that that issue was addressed with the latest release since labels are now handled and properly generated in HTML if you want Creole syntax. However, if you want it in the AST that part needs pegdown parser modification or a parser plugin.

tmortagne commented 8 years ago

Ok #92 might be seen from HTML rendering point of view. But that would certainly be nice in the AST :)

vsch commented 8 years ago

@tmortagne, I just assume that most people must be using it to generate HTML. Otherwise they would have encountered and reported the bugs in the AST.

vsch commented 7 years ago

@tmortagne, for my project the effort of maintaining and supporting pegdown related issues became unbearable.

I wound up rewriting commonmark-java to replace pegdown in my Markdown Navigator plugin for IntelliJ IDEs: https://github.com/vsch/idea-multimarkdown.

The parser project is https://github.com/vsch/flexmark-java has very detailed source based AST with source offset for every part of the element. I need that for syntax highlighting and other plugin source reliant features.

The AST is very regular, part of all the tests not just ones geared for AST testing and source offsets are free of idiosyncrasies and easily adjusted. It is fully modifiable unlike pegdown's with next, prev and parent links that you can walk, unlink and move around if needed. There are many ways to extend the parser including turning off core parsing features. I geared it for extensibility and made sure handling of AST nodes is uniform whether it is part of the core or part of an extension. So extensions can and do add their own nodes to the AST.

It is CommonMark 0.27 (GitHub Comments) compliant but has parser configuration options to emulate list indentation rules used by: markdown.pl, MultiMarkdown (like pegdown 4 space indents) and kramdown (GitHub Docs). The only extensions that pegdown has that I did not yet implement are: typographic quotes, smarts and definition lists. The rest of the extensions are available, with some extra ones that pegdown does not have.

As an added bonus and what motivated me to write my own parser: the parsing is 30-50x faster than pegdown on average documents and several thousand times faster on pegdown's pathological input like [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[.

Maintaining it is a walk in the park compared to what I had to do with pegdown. The tests are in a modified markdown format, actually CommonMark test spec format extended to include the AST. The AST generation is tested on every test not just some chosen few like it was on pegdown. I literally lost weeks debugging AST source offset errors in pegdown for cases that were not part of its AST tests.

I also have code to convert from pegdown Extensions flags to my uniform configuration API that I used in the initial migration from pegdown to flexmark-java. So if you are ever on the lookout for a pegdown replacement let me know and I will give you a hand in your migration trial.

tmortagne commented 7 years ago

@vsch this does look very interesting indeed. Markdown is not in my TODO list these days but I created http://jira.xwiki.org/browse/MARKDOWN-16 to remember about it if someone want to looks at it in more details