vmg / redcarpet

The safe Markdown parser, reloaded.
MIT License
4.97k stars 524 forks source link

Character encoding issue with autolinking #388

Open whatupdave opened 10 years ago

whatupdave commented 10 years ago

Not sure what's causing this:

> ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('d@example.coü')"
<p><a href="mailto:d@example.co%C3">d@example.co�</a>�</p>

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('d@example.coü').inspect"
"<p><a href=\"mailto:d@example.co%C3\">d@example.co\xC3</a>\xBC</p>\n"

It's fine without autolinking:

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: false).render('d@example.coü')"
<p>d@example.coü</p>
neilmiddleton commented 9 years ago

Yup, I've just hit this same issue.

david50407 commented 9 years ago

I've hit this same issue, too.

It will spilt out my UTF-8 char, into link with first part of bytes and other bytes keep outside the link.

example hi@www.com\u300D into <a href="mailto:hi@www.com%E3">mailto:hi@www.com\xE3</a>\x80\x8D

ericgoodwin commented 9 years ago

I'm having the same issue as well. Any ideas of a fix for this @vmg?

david50407 commented 9 years ago

I think the problem is the same as https://github.com/vmg/redcarpet/pull/358

But why a UTF-8 char can be splited...

david50407 commented 9 years ago

I've traced the code and extract the function of sd_autolink__email into my test code, but it works well.

It's so wired, because after copying the link into buffer in sd_autolink__email it calls the callback of autolink with passing the link.

But if sd_autolink__email is functioning normally, the callback wouldn't get the wrong link.

david50407 commented 9 years ago

BTW, Rinku has the same issue. https://github.com/vmg/rinku

david50407 commented 9 years ago

I found the point here: https://github.com/vmg/redcarpet/blob/master/ext/redcarpet/autolink.c#L227

    for (link_end = 0; link_end < size; ++link_end) {
        uint8_t c = data[link_end];

        if (isalnum(c)) /* HERE */
            continue;

        if (c == '@')
            nb++;
        else if (c == '.' && link_end < size - 1)
            np++;
        else if (c != '-' && c != '_')
            break;
    }

That when passing (\xE3\x80\x8D), it returns TRUE from isalnum(0xE3).

When I modified the if statement into if (isalnum(c) && c < 0x7f), it works fine.

ryrych commented 8 years ago

Not sure if it is redcarpet related (or upstream-kramdown), but I have the same problem when header contains a UTF-8 character:

# dupa
## dópa
redcarpet --render with_toc_data test.md
<h1 id="dupa">dupa</h1>
<h2 id="d�pa">dópa</h2>

When jekyll makes a build I get the following exception:

Liquid Exception: invalid byte sequence in UTF-8 in feed.xml
jekyll 2.4.0 | Error:  invalid byte sequence in UTF-8

Normally I'd use an urlify implementation like this one: https://github.com/beastaugh/urlify, but it seems that the escaping is done with C… well I don't have a slightest idea how to debug it ;)

@vmg hope it helps someway :)

MadPositron commented 6 years ago

I'm getting invalid byte sequence in UTF-8, trying to render markdown w/ redcarpet on the following char, but only if it's in the (bash) code block. Outside of the codeblock it works fine. The char is on the first line of the code block.

¢

mdchaney commented 4 years ago

I'm still getting this issue when using autolinking. UTF-8 characters are being split apart when they appear after a piece of text that will be autolinked. For instance:

Email me at “someone@somewhere.com”

Is going to cause problems. Is there a fix for this?

david50407 commented 4 years ago

@mdchaney patch is already here... https://github.com/vmg/redcarpet/pull/463

mdchaney commented 4 years ago

Okay, I'll just pull from repo then. Are there plans of another release?

david50407 commented 4 years ago

I have no idea that is this repo going to merge the patch or not. So, just apply the patch by yourself. lol

mdchaney commented 4 years ago

Yeah, I realized that. Ugh. Looks like redcarpet has been abandoned - one of us probably should fork it and apply outstanding merge requests. This particular one is a biggy.

jstewart commented 3 years ago

@vmg - Any chance of a fix for this? This one is bitting me as well. This bug can be easily reproduced like this:

renderer = Redcarpet::Render::HTML.new(with_toc_data: true)
md = Redcarpet::Markdown.new(renderer, no_intra_emphasis: true, tables: true, autolink: true, quote: true)
md.render("“foo@example.com“")

# => "<p>“<a href=\"mailto:foo@example.com%E2\">foo@example.com\xE2</a>\x80\x9C</p>\n"
# irb(main):008:0> md.render("“foo@example.com“").valid_encoding?
# => false
fwolfst commented 1 month ago

Just checked why we are maintaining an own fork as well. @robin850 thanks for your last merges and releases. Do you see any chance to merge this one? Do you need any help?