Unicode and html entities

ikirudennis commented 8 years ago

I've recently spent some time updating python-textile to resolve some bugs. Most of the bugs were related to link handling, and I stumbled upon something that leaves me wondering what the best course of action is.

When php-textile handles links, it has some functionality it uses where it implements its own html tag class to generate the a tags. Python has a few built-in modules which make generating html tags easier, so it seemed like a good idea to make use of that (why reinvent the wheel?). It turns non-ascii innerText into html entities. So, both versions have tests which include input like: "Übermensch":http://de/wikipedia.org/wiki/Übermensch. php-textile generates <a href="http://de/wikipedia.org/wiki/%C3%9Cbermensch">Übermensch</a> which is reasonably correct. I've tweaked the python version to generate the same, but without the tweaking, it will turn the innerText of the link to Übermensch.

I realize that this may be a tall order to handle in the tag class. I try to make the python version match output exactly as the php version does (it seems like the right thing to do, and hopefully means fewer version-specific bugs). Is there any interest in adding this unicode-to-HTML-entity translation to php-textile? How does it strike you if this is one way the two versions deviate? Is this something that's more of a problem for the spec? If you're interested, I could lend a hand translating the python tag module into a php class.

netcarver commented 8 years ago

@ikirudennis

Thanks for the report and for your kind offer regarding putting work into the php class if needed in this case! Before we go down that road, is there anything you can point me to that indicates that leaving the inner text verbatim with the actual utf-8 encoded character is incorrect for HTML5 - or why making the substitution to an HTML entity might be prefered in these cases? If there is clear evidence that entity encoding would be beneficial then it should be easy enough to do.

ikirudennis commented 8 years ago

I feel it should be noted that this would apply to innerText everywhere, not just within links. I think I see the subtext in your question (if it ain't broke, don't fix it), but allow me to play the devil's advocate for a moment...

The HTTP 1.1 Protocol says that in the absence of a specified charset, the default charset is "ISO-8859-1". If a user of the textile library hasn't configured the server with a default charset, and hasn't specified a charset in the header of the HTML document, the unicode will not display properly. However, the html-entity will display properly regardless of the charset specified by either the server or the document.

I know, I know, step number one of setting up the both the server and the code serving the html document include setting the charset to utf-8. The second sentence in the above paragraph is a BIG if.

Another argument to make: we already encode necessary things (less than and greater than symbols because it's necessary) and some extended characters (because it's helpful). It seems like being more consistent would be both advantageous and more helpful. Also, it looks like the following will do the heavy lifting:

<?php
$a = "Übermensch";
print(mb_encode_numericentity($a, array(0x80, 0xff, 0, 0xff), 'UTF-8'));
// &#220;bermensch

I have no firmly-held beliefs one way or the other, but I think the debate is worth having. I think these sorts of questions are an indication that textile is complete enough that we can consider these deeper-level issues. I'm bringing this up as a curiosity which didn't occur to me until python provided me an alternative.

netcarver commented 8 years ago

Thanks for the feedback. I agree the debate is worth having so I'll leave this open for more feedback.

netcarver commented 8 years ago

@ikirudennis Looks like it's just going to be you and I discussing this, Dennis.

I certainly see your point regarding the default encoding situation, so I started a new branch locally this evening and applied your suggestion to the inner text of links (just links, for now.) Whilst the implementation was fairly straightforward, doing it has brought some thoughts to the fore. Specifically;

Substituting additional HTML entities for utf-8 characters will make reading the page source code more troublesome for most users.
What's the size of the user group that isn't going to configure a correct stack vs the size of the group that will? Are most people who are going to need utf-8 characters in their inner text, likely to know it and configure for it?
Having the page display non-readable characters can act as a forcing function to get people to learn about, and correctly configure, their stack.
Having the page display non-readable characters can make people think textile is broken.
For php-textile, at least, it will take some time to cover all the cases where inner text needs to be encoded.
More work for textile to do => slower performance.
There is, undoubtedly, better tolerance to incorrect config doing the encoding.

There's probably more that hasn't appeared to my monkey brain yet, but it's gone midnight here - and I wanted to get something out before I retire; this issue's been left alone for too long.

drewm commented 8 years ago

We use php-textile in @PerchCMS, and would be against the unnecessary HTML-encoding of UTF-8 characters.

If a user has a case where they've not configured a charset in an application handling non-ASCII text, then it is not a service to the user to attempt to mask their mistake.

If this were to be implemented, I'd like to see it behind a configuration option. Personally I see it as an unnecessary option and therefore an undue burden on the codebase.

gocom commented 5 years ago

Introducing a feature to encode unicode characters as HTML entities is rather overkill, and out of the scope of the project. A website or application should be serving their content properly, with the encoding they wish to use, and we don't exactly for you to use UTF-8 if you don't so want to.

If one wants to encode their output like that, they can do it separately on the document. It doesn't really relate to parsing Textile syntax, especially in a case where it would be the whole document rather than just links; which currently are encoded according to the spec.

textile / php-textile

Unicode and html entities #159