Open whatupdave opened 10 years ago
Yup, I've just hit this same issue.
I've hit this same issue, too.
It will spilt out my UTF-8 char, into link with first part of bytes and other bytes keep outside the link.
example hi@www.com\u300D
into <a href="mailto:hi@www.com%E3">mailto:hi@www.com\xE3</a>\x80\x8D
I'm having the same issue as well. Any ideas of a fix for this @vmg?
I think the problem is the same as https://github.com/vmg/redcarpet/pull/358
But why a UTF-8 char can be splited...
I've traced the code and extract the function of sd_autolink__email
into my test code, but it works well.
It's so wired, because after copying the link into buffer in sd_autolink__email
it calls the callback of autolink
with passing the link.
But if sd_autolink__email
is functioning normally, the callback wouldn't get the wrong link.
BTW, Rinku
has the same issue.
https://github.com/vmg/rinku
I found the point here: https://github.com/vmg/redcarpet/blob/master/ext/redcarpet/autolink.c#L227
for (link_end = 0; link_end < size; ++link_end) {
uint8_t c = data[link_end];
if (isalnum(c)) /* HERE */
continue;
if (c == '@')
nb++;
else if (c == '.' && link_end < size - 1)
np++;
else if (c != '-' && c != '_')
break;
}
That when passing 」
(\xE3\x80\x8D
), it returns TRUE
from isalnum(0xE3)
.
When I modified the if statement into if (isalnum(c) && c < 0x7f)
, it works fine.
Not sure if it is redcarpet related (or upstream-kramdown), but I have the same problem when header contains a UTF-8 character:
# dupa
## dópa
redcarpet --render with_toc_data test.md
<h1 id="dupa">dupa</h1>
<h2 id="d�pa">dópa</h2>
When jekyll makes a build I get the following exception:
Liquid Exception: invalid byte sequence in UTF-8 in feed.xml
jekyll 2.4.0 | Error: invalid byte sequence in UTF-8
Normally I'd use an urlify
implementation like this one: https://github.com/beastaugh/urlify, but it seems that the escaping is done with C… well I don't have a slightest idea how to debug it ;)
@vmg hope it helps someway :)
I'm getting invalid byte sequence in UTF-8, trying to render markdown w/ redcarpet on the following char, but only if it's in the (bash) code block. Outside of the codeblock it works fine. The char is on the first line of the code block.
¢
I'm still getting this issue when using autolinking. UTF-8 characters are being split apart when they appear after a piece of text that will be autolinked. For instance:
Email me at “someone@somewhere.com”
Is going to cause problems. Is there a fix for this?
@mdchaney patch is already here... https://github.com/vmg/redcarpet/pull/463
Okay, I'll just pull from repo then. Are there plans of another release?
I have no idea that is this repo going to merge the patch or not. So, just apply the patch by yourself. lol
Yeah, I realized that. Ugh. Looks like redcarpet has been abandoned - one of us probably should fork it and apply outstanding merge requests. This particular one is a biggy.
@vmg - Any chance of a fix for this? This one is bitting me as well. This bug can be easily reproduced like this:
renderer = Redcarpet::Render::HTML.new(with_toc_data: true)
md = Redcarpet::Markdown.new(renderer, no_intra_emphasis: true, tables: true, autolink: true, quote: true)
md.render("“foo@example.com“")
# => "<p>“<a href=\"mailto:foo@example.com%E2\">foo@example.com\xE2</a>\x80\x9C</p>\n"
# irb(main):008:0> md.render("“foo@example.com“").valid_encoding?
# => false
Just checked why we are maintaining an own fork as well. @robin850 thanks for your last merges and releases. Do you see any chance to merge this one? Do you need any help?
Not sure what's causing this:
It's fine without autolinking: