html entities in filter output

tg-x commented 14 years ago

if there are html entities in the filter output, like – or &hallip; (although & does not seem to cause a problem) the following happens:

in preview mode: no content is displayed
after saving: these html entities are removed

(I tried this with an emacs orgmode filter which generates html from org documents using emacs)

tg-x commented 14 years ago

seems like this works now, but there's a new problem with displaying utf8 characters, they appear as 2 latin1 characters

tg-x commented 14 years ago

ok this is because of the change in helper.rb:

Nokogiri::HTML produces wrong encoding, whereas with Nokogiri::XML the HTML entities do not work

tg-x commented 14 years ago

fixed in tgbit/olelo/@34143fc10e3807f7a29602538973fbdb93dfdc37 & tgbit/olelo/@5c276996f582cd105f49f682cdb7c588ce5a5244

minad commented 14 years ago

why is the html encoding wrong? Can this be fixed?

minad commented 14 years ago

see also issue #28.

tg-x commented 14 years ago

see previous comment, it works if fragments are added with Nokogiri::HTML::DocumentFragment.parse()

otherwise two latin1 characters are displayed instead of multi-byte utf8 characters

minad commented 14 years ago

hmm this is kind of a hack

tg-x commented 14 years ago

well, i don't know nokogiri that well, maybe it's a bug in nokogiri? does this happen to you as well? try some accented characters in the sidebar or preview, in the content area it works because it's already in the content when passed to nokogiri, whereas the sidebar and preview content are added later in the layout hooks

minad commented 14 years ago

It would be nice if we could share a git repository with test pages here on github for example. On my installation everything seems to be good.

minad commented 14 years ago

please try the xmlentries branch

minad commented 14 years ago

seems to be a broken libxml version. please confirm:

http://github.com/minad/olelo/commit/6ee25db8d754e75de04781377d835986b54b30f8

tg-x commented 14 years ago

well you still need to use that patch I posted above to get properly encoded chararcters, even with the newer libxml, I updated it to use XMLFragment: tgbit/olelo/@280ae6dba407f7aa17b1fc731092eb400089404b

minad commented 14 years ago

html entities work now? but encoding is wrong? you are on ruby 1.8.7?

tg-x commented 14 years ago

html entities has been working since the change from nokogiri::xml to nokogiri::html, but the encoding has not been working properly after that change, only with these fixes yes ruby 1.8.7

minad commented 14 years ago

can you create a separate test case using nokogiri?

require 'nokogiri'
doc = Nokogiri::HTML::Document('<html></html>', nil, 'UTF-8')
doc.before '<tag/>'
...

something like that.

tg-x commented 14 years ago

I just noticed that the sidebar is ok now without any change, only the preview has problems with encoding

tg-x commented 14 years ago

i ran the following test, as you can see when DocumentFragment is not used there are two characters printed for each accented character instead of one:

code:

require 'nokogiri'

content = '<html><head></head><body><div class="content">hëlló &ndash; wörld!</div></body></html>'
preview = '<div class="preview">hëlló &ndash; wörld</div>'

doc = Nokogiri::HTML(content, nil, 'UTF-8')
doc.css('.content').after preview
doc.css('.preview').after Nokogiri::HTML::DocumentFragment.parse preview

print doc.to_xhtml

output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
  <body>
    <div class="content">h&#xEB;ll&#xF3; &#x2013; w&#xF6;rld!</div>
    <div class="preview">h&#xC3;&#xAB;ll&#xC3;&#xB3; &#x2013; w&#xC3;&#xB6;rld</div>
    <div class="preview">h&#xEB;ll&#xF3; &#x2013; w&#xF6;rld</div>
  </body>
</html>

minad commented 14 years ago

I cannot reproduce. Please post this on the nokogiri issue tracker. It works on 1.8 and 1.9 for me:

<div class="content">h&#xEB;ll&#xF3; &#x2013; w&#xF6;rld!</div>
<div class="preview">h&#xEB;ll&#xF3; &#x2013; w&#xF6;rld</div>
<div class="preview">h&#xEB;ll&#xF3; &#x2013; w&#xF6;rld</div>

tg-x commented 14 years ago

hm, interesting, so it's just me, i'll post it there then

tg-x commented 14 years ago

ok i just realized that even though i upgraded libxml2 it's still not the latest, i'll try to upgrade more :)

tg-x commented 14 years ago

ok, works now with 2.7.7

minad / olelo

html entities in filter output #25