ANCHORLINKS doesn't work for headers containing some special characters

kevin-lee commented 9 years ago

Pegdown version	JDK Version
1.5.0	8

There are two issues.

When a header contains some special character, ANCHORLINKS does not work.
All the links with the same words but different special chars have exactly the same link.
First Issue (No link for some special chars)

For the first issue, I've tested with all the special chars from my keyboard.

(O): Link (working)
(X): No Link (not working)

# Some ! Heading (X)
# Some @ Heading (O)
# Some # Heading (X)
# Some $ Heading (O)
# Some % Heading (O)
# Some ^ Heading (O)
# Some & Heading (X)
# Some * Heading (O)
# Some ( Heading (O)
# Some ) Heading (O)
# Some - Heading (O)
# Some _ Heading (O)
# Some + Heading (O)
# Some ; Heading (O)
# Some : Heading (O) <= (X): When `DEFINITIONS` is enabled.
# Some ' Heading (O)
# Some " Heading (O)
# Some [ Heading (X)
# Some ] Heading (X)
# Some { Heading (O)
# Some } Heading (O)
# Some | Heading (O)
# Some \ Heading (X)
# Some < Heading (X)
# Some > Heading (X)
# Some , Heading (O)
# Some . Heading (O)
# Some / Heading (O)
# Some ? Heading (O)
# Some ` Heading (X)
# Some ~ Heading (O)

new PegDownProcessor(ANCHORLINKS)

When DEFINITIONS is set as well, # Some : Heading has no link (doesn't work for : colon).

new PegDownProcessor(DEFINITIONS | ANCHORLINKS)

Second Issue

All the links generated for all the headers I put above are exactly the same. They are #some-heading.

What GitHub Does

On GitHub, it looks like this. It all works. GitHub markdown replaces a space char with - so the link for a case like # Some - Heading becomes some---heading (triple -s) and #Some _ Heading becomes some-_-heading. If the char can't be used in the URL, it is removed then a sequential number is added to the end so each case has different link name.

e.g.)

# Some ) Heading  => some--heading-1
# Some - Heading  => some---heading
# Some _ Heading  => some-_-heading
# Some + Heading  => some--heading-2

For testing, you can use the following cases.

# Some ! Heading
# Some @ Heading
# Some # Heading
# Some $ Heading
# Some % Heading
# Some ^ Heading
# Some & Heading
# Some * Heading
# Some ( Heading
# Some ) Heading
# Some - Heading
# Some _ Heading
# Some + Heading
# Some ; Heading
# Some : Heading
# Some ' Heading
# Some " Heading
# Some [ Heading
# Some ] Heading
# Some { Heading
# Some } Heading
# Some | Heading
# Some \ Heading
# Some < Heading
# Some > Heading
# Some , Heading
# Some . Heading
# Some / Heading
# Some ? Heading
# Some ` Heading
# Some ~ Heading

stavytskyi commented 9 years ago

I have the same problem. As example:

# Some Header (TBD)
# Some :

kevin-lee commented 9 years ago

@stavytskyi ~~That's fixed #162. It's just not released yet. I hope it will happen soon.~~ <= WRONG! I'm sorry.

stavytskyi commented 9 years ago

@Kevin-Lee thank you for information. I hope release will be soon.

kevin-lee commented 9 years ago

@stavytskyi Oops! Sorry I was wrong. I thought this issue was #159. This is is not fixed yet unfortunately.

Deraen commented 8 years ago

I believe this mostly works with new EXTANCHORLINKS option, though all non alphanumeric characters in header are just ignored for anchor name (unlink what github does).

vsch commented 8 years ago

@Kevin-Lee, the question I have is how do you know what the generated anchor link is like so you can link to it in markdown, without looking at the generated HTML?

If you can't easily predict the generated anchor then it is quite useless except for making the header go to top of page when you click on the header. Seems like not much use out of that.

However, these anchor links are useful for creating your own links within the page, but that means you need to know what that anchor will be based on the header.

I for one would not care for the fancy algo that I can't predict without looking at the HTML.

kevin-lee commented 8 years ago

@vsch You can use a preview feature to get the link. You need to check what it actually is sometimes. For instance, if a header contains some chars which are invalid for URL. I think it is harder to remember the rules of which are valid for header is more difficult than using preview to check out what the actual link is.

vsch commented 8 years ago

I am working on this right now and as far as I have experimented with GitHub and the logic is simple:

map: ' ' -> '-', '-' -> '-', '' -> '', 0..9 -> 0..9, a..z -> a..z, A..Z -> a..z, all other characters are ignored.
take resulting text as the base reference id
If another base of equal value was encountered before, append -# where # is 1...., for subsequent references.

No attempt is made to eliminate conflicts with future headers that may clash. For example for headers appearing in the order given with text: abcd, abcd, abcd, abcd-1

Will generate references to: abcd, abcd-1, abcd-2, abcd-1

I already made changes to pegdown to make it easier to extend the ToHtmlSerializer without needing to re-implement a big chunk of it. The ref ids get computed at the top most RootNode visit by calling a member function you can override in your implementation. If not overridden you get current behaviour: i.e. regression tests pass. This way, out of the box pegdown can be customized without needed to play with the parser or the HtmlSerializer.

Here are the notes I am adding to CHANGELOG since I won't have time to update java docs, I hope someone else will be up to it, or users can refer to the source which is always the most reliable reference. ;)

The following are added to allow easy customization of ToHtmlSerializer without rewriting it.

NOTE: these changes are backwards compatible if you don't override the new functions or you can override them to customize the HTML output or leave it as is to get the old behaviour.

Added preview(Node node, String tag, Attributes attributes, boolean tocGenerationVisit) to ToHtmlSerializer, called before every node that has a tag so that derived classes can modify attributes output in the HTML, returned attributes will be output in the order they were added to Attributes. Re-use the passed in parameter or create a new one.
Added Printer.preview(node, String tag, attributes, boolean tocGenerationVisit) to Printer so that the serializer's preview() can be accessed by plugins and verbatim serializers. Together with Attributes methods you can change classes, add attributes based on tags and node parameters without rewriting the whole serializer.

Note: tocGenerationVisit will be true if the output is for TOC rendering. You will get the same nodes once with tocGenerationVisit true and once false for nodes that are part of TOC headers. In all cases the passed in attributes contain the default attributes as they are now rendered by ToHtmlSerializer. If you return them unmolested, you will get output as it is now.
Added ToHtmlSerializer.printTaskListItemMarker(Printer printer, TaskListNode node, boolean isParaWrapped) that prints the task list item marker, default prints an input checkbox, isParaWrapped is true when the contents of li tag are wrapped in <p></p> just in case it makes a difference to what you want to output.
Changed DefaultVerbatimSerializer to also print attributes returned by the call to preview() before closing the <code tag. Passed in attributes will contain the class of the node.getType() value.
Added String computeHeaderId(HeaderNode node, AnchorLinkNode anchorLinkNode, String headerText) to ToHtmlSerializer called before generating any HTML in the top most RootNode processing for all headers, depth first traversal. Returning an empty string will output Header without id attribute. Returning any other value will output header with that id and if there is an anchor link it will also change its reference and name attribute to match the returned value. Use this to override how anchor link references are generated. Additional benefit if you override this function is that you will always know what id to expect for the header and can generate the right reference.

This way you can create your own logic for generating the reference link ids to match your requirements by overriding only a single member of ToHtmlSerializer. Similarly, to change the way task lists are generated it is also a single override.

In addition, another member override: preview(...) that passes the node, the node's tag and current attribute set will allow you to add/remove/append class or any other attribute. By default the class that implements Attributes can handle multiple calls to add(name, value) with the same name. It will append value as a space delimited list. So out of the box you can just add("class", whateverClass) to attributes for nodes of your choice to add extra classes.

This function is also called for VerbatimNode from the DefaultVerbatimSerializer so you can add/remove/change the class assigned to <code>.

matthova commented 2 years ago

Apologies if this is a bit off topic

1. map: ' ' -> '-', '-' -> '-', '_' -> '_', 0..9 -> 0..9, a..z -> a..z, A..Z -> a..z, all other characters are ignored.

I found this issue on google, when trying to make an anchor tag link. The all other characters are ignored part helped me fix my issue. [@my-package/foo](#my-packagefoo) -> # @my-package/foo was the correct way to map it Thanks @vsch !

sirthias / pegdown