sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions
http://pegdown.org
Apache License 2.0
1.29k stars 218 forks source link

ANCHORLINKS doesn't work for headers containing some special characters #160

Open kevin-lee opened 9 years ago

kevin-lee commented 9 years ago
Pegdown version JDK Version
1.5.0 8

There are two issues.

  1. When a header contains some special character, ANCHORLINKS does not work.
  2. All the links with the same words but different special chars have exactly the same link.

    First Issue (No link for some special chars)

For the first issue, I've tested with all the special chars from my keyboard.

(O): Link (working)
(X): No Link (not working)

# Some ! Heading (X)
# Some @ Heading (O)
# Some # Heading (X)
# Some $ Heading (O)
# Some % Heading (O)
# Some ^ Heading (O)
# Some & Heading (X)
# Some * Heading (O)
# Some ( Heading (O)
# Some ) Heading (O)
# Some - Heading (O)
# Some _ Heading (O)
# Some + Heading (O)
# Some ; Heading (O)
# Some : Heading (O) <= (X): When `DEFINITIONS` is enabled.
# Some ' Heading (O)
# Some " Heading (O)
# Some [ Heading (X)
# Some ] Heading (X)
# Some { Heading (O)
# Some } Heading (O)
# Some | Heading (O)
# Some \ Heading (X)
# Some < Heading (X)
# Some > Heading (X)
# Some , Heading (O)
# Some . Heading (O)
# Some / Heading (O)
# Some ? Heading (O)
# Some ` Heading (X)
# Some ~ Heading (O)
new PegDownProcessor(ANCHORLINKS)
new PegDownProcessor(DEFINITIONS | ANCHORLINKS)

Second Issue

All the links generated for all the headers I put above are exactly the same. They are #some-heading.

What GitHub Does

On GitHub, it looks like this. It all works. GitHub markdown replaces a space char with - so the link for a case like # Some - Heading becomes some---heading (triple -s) and #Some _ Heading becomes some-_-heading. If the char can't be used in the URL, it is removed then a sequential number is added to the end so each case has different link name.

e.g.)

# Some ) Heading  => some--heading-1
# Some - Heading  => some---heading
# Some _ Heading  => some-_-heading
# Some + Heading  => some--heading-2

# Some ! Heading
# Some @ Heading
# Some # Heading
# Some $ Heading
# Some % Heading
# Some ^ Heading
# Some & Heading
# Some * Heading
# Some ( Heading
# Some ) Heading
# Some - Heading
# Some _ Heading
# Some + Heading
# Some ; Heading
# Some : Heading
# Some ' Heading
# Some " Heading
# Some [ Heading
# Some ] Heading
# Some { Heading
# Some } Heading
# Some | Heading
# Some \ Heading
# Some < Heading
# Some > Heading
# Some , Heading
# Some . Heading
# Some / Heading
# Some ? Heading
# Some ` Heading
# Some ~ Heading
stavytskyi commented 9 years ago

I have the same problem. As example:

# Some Header (TBD)
# Some :
kevin-lee commented 9 years ago

@stavytskyi That's fixed #162. It's just not released yet. I hope it will happen soon. <= WRONG! I'm sorry.

stavytskyi commented 9 years ago

@Kevin-Lee thank you for information. I hope release will be soon.

kevin-lee commented 9 years ago

@stavytskyi Oops! Sorry I was wrong. I thought this issue was #159. This is is not fixed yet unfortunately.

Deraen commented 8 years ago

I believe this mostly works with new EXTANCHORLINKS option, though all non alphanumeric characters in header are just ignored for anchor name (unlink what github does).

vsch commented 8 years ago

@Kevin-Lee, the question I have is how do you know what the generated anchor link is like so you can link to it in markdown, without looking at the generated HTML?

If you can't easily predict the generated anchor then it is quite useless except for making the header go to top of page when you click on the header. Seems like not much use out of that.

However, these anchor links are useful for creating your own links within the page, but that means you need to know what that anchor will be based on the header.

I for one would not care for the fancy algo that I can't predict without looking at the HTML.

kevin-lee commented 8 years ago

@vsch You can use a preview feature to get the link. You need to check what it actually is sometimes. For instance, if a header contains some chars which are invalid for URL. I think it is harder to remember the rules of which are valid for header is more difficult than using preview to check out what the actual link is.

vsch commented 8 years ago

I am working on this right now and as far as I have experimented with GitHub and the logic is simple:

  1. map: ' ' -> '-', '-' -> '-', '' -> '', 0..9 -> 0..9, a..z -> a..z, A..Z -> a..z, all other characters are ignored.
  2. take resulting text as the base reference id
  3. If another base of equal value was encountered before, append -# where # is 1...., for subsequent references.

No attempt is made to eliminate conflicts with future headers that may clash. For example for headers appearing in the order given with text: abcd, abcd, abcd, abcd-1

Will generate references to: abcd, abcd-1, abcd-2, abcd-1

I already made changes to pegdown to make it easier to extend the ToHtmlSerializer without needing to re-implement a big chunk of it. The ref ids get computed at the top most RootNode visit by calling a member function you can override in your implementation. If not overridden you get current behaviour: i.e. regression tests pass. This way, out of the box pegdown can be customized without needed to play with the parser or the HtmlSerializer.

Here are the notes I am adding to CHANGELOG since I won't have time to update java docs, I hope someone else will be up to it, or users can refer to the source which is always the most reliable reference. ;)

The following are added to allow easy customization of ToHtmlSerializer without rewriting it.

NOTE: these changes are backwards compatible if you don't override the new functions or you can override them to customize the HTML output or leave it as is to get the old behaviour.

This way you can create your own logic for generating the reference link ids to match your requirements by overriding only a single member of ToHtmlSerializer. Similarly, to change the way task lists are generated it is also a single override.

In addition, another member override: preview(...) that passes the node, the node's tag and current attribute set will allow you to add/remove/append class or any other attribute. By default the class that implements Attributes can handle multiple calls to add(name, value) with the same name. It will append value as a space delimited list. So out of the box you can just add("class", whateverClass) to attributes for nodes of your choice to add extra classes.

This function is also called for VerbatimNode from the DefaultVerbatimSerializer so you can add/remove/change the class assigned to <code>.

matthova commented 2 years ago

Apologies if this is a bit off topic

1. map: ' ' -> '-', '-' -> '-', '_' -> '_', 0..9 -> 0..9, a..z -> a..z, A..Z -> a..z, all other characters are ignored.

I found this issue on google, when trying to make an anchor tag link. The all other characters are ignored part helped me fix my issue. [@my-package/foo](#my-packagefoo) -> # @my-package/foo was the correct way to map it Thanks @vsch !