ASCII vs UTF-8 - Githubissues

lassik commented 4 years ago

@johnwcowan I think XML and HTML can both always be written as ASCII, using #&xABCD; Unicode escapes for any characters that aren't ASCII graphic. Is this right?

If so, I think the procedures in this SRFI should always write ASCII. It's just much simpler to not have to rely on Content-Type. An ASCII file will work file no matter what the Content-Type is.

johnwcowan commented 4 years ago

It's correct that any character can be escaped in an attribute value or text content. (Not so in tag and attribute names, but all HTML names and the vast majority of XML ones are in ASCII.)

However, escaping a Chinese document makes it go from 3 bytes per Chinese character in UTF-8 to 7 bytes per character. That cuts transmission rates in half (though not so much if there is plenty of markup).

Always sending UTF-8 and automatically appending ";charset=utf-8" to "text/*" types seems perfectly straightforward to me.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types lists the MIME types that hyperserver should be seeded with.

lassik commented 4 years ago

It's not straightforward. I keep running into charset problems, they keep cropping up year after year. I'd prefer the all-ASCII ones because they work everywhere.

Since this is meant to be a minimal SRFI, we could make a more complex one that allows setting the charset.

johnwcowan commented 4 years ago

If the encoding is always UTF-8, as the WHATWG and the W3C strongly recommend, then it is always safe to specify the charset (although it is also recommended to put it into a meta element).

lassik commented 4 years ago

I regularly run into software that doesn't follow these strong recommendations. Have never run into anything where ASCII doesn't work.

johnwcowan commented 4 years ago

What kind of software? Servers? Browsers? Offline document processors?

As I understand it, our use case is "write Scheme and/or S-expressions, generate XML and/or CSS". That means we control how it comes out, either as strings or texts or bytevectors.

On Mon, Sep 21, 2020 at 3:05 PM Lassi Kortela notifications@github.com wrote:

I regularly run into software that doesn't follow these strong recommendations. Have never run into anything where ASCII doesn't work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pre-srfi/minimal-html-css-writer/issues/2#issuecomment-696310936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPPBQ7BNNOWTTUV2EQPRDSG6PXDANCNFSM4RUUFD3A .

lassik commented 4 years ago

What kind of software? Servers? Browsers? Offline document processors?

Last case was nginx (i.e. server) a few months back; it was serving text or HTML files with a default charset other than UTF-8 and it was so hard to change that I gave up.

I recall having some problems on the browser end as well, but none with command-line filters or such.

Most of these problems are due to bad or missing defaults for HTTP content-type and HTML meta charset. There's so much software out there that it's hard to have everything configured right by default. HTML is great in that we have the option of encoding everything as ASCII, which not all formats give us.

As I understand it, our use case is "write Scheme and/or S-expressions, generate XML and/or CSS".

That's right.

That means we control how it comes out, either as strings or texts or bytevectors.

Yes. We have string->utf8 and transcoded ports so on our end there's no problem. The problems tend to start when the files leave the computer on which they were generated.

johnwcowan commented 4 years ago

On Mon, Sep 21, 2020 at 4:12 PM Lassi Kortela notifications@github.com wrote:

Last case was nginx (i.e. server) a few months back; it was serving text or HTML files with a default charset other than UTF-8 and it was so hard to change that I gave up.

Yes, there do seem to be a lot of SO questions on that point. But there is also a recipe for solving it: make sure that nginx.conf has "http { charset utf-8; }" in it (which forces the Content-Type header to be correct), make sure that each HTML file has a proper <meta charset="utf-8"/>, and make sure that the content actually is in UTF-8, in increasing order of importance.

Most of these problems are due to bad or missing defaults for HTTP content-type and HTML meta charset.

We can make sure that when we construct HTML it always has the meta element in the right place and with the right value. There is no need to be flexible about it. The recipe for getting the header right is server-dependent, of course.

lassik commented 4 years ago

I for one wasn't smart enough to get nginx to work with reasonable effort.

The meta tag makes things more complex:

Users have to remember to insert it.
Or we have to splice it into user SXML; that's complicated.
If the web server and the meta tag disagree on the content-type, who wins? There's a rule about is somewhere, but why are we thinking about this stuff.

What problems are there with all-ASCII other than some wasted space for CJK text? Modern web servers can use gzip.

If you want to edit the CJK in a text editor, the current SRFI also doesn't indent the HTML so editing is hard. People don't agree on how it should be indented. If you give HTML Tidy some characters like 漢 it decodes them. (For some reason, it doesn't insert a meta tag to say what character encoding it uses to do that, although it inserts other tags such as missing <html> and <body>. The modern version I have seems to use UTF-8.)

dpk commented 3 years ago

As I wrote in #4, &xyz; escaping doesn’t work in <script>, <style>, etc. elements, but Unicode characters can occur freely inside them. So ‘escape everything to try to render as ASCII’ doesn’t work for HTML, either.

In any case, doesn’t this library return Scheme strings, which are arrays of codepoints, not of bytes? How is encoding a concern of this library? The only way this could cause problems at this level is if the <meta charset> gets set to something that doesn’t match the eventual on-disk/on-the-wire encoding the HTML is actually sent with. But the content of <meta charset> isn’t our concern either, anyway, since it’s part of the SXML input tree — it’s the programmer’s responsibility to make sure the two match.

lassik commented 3 years ago

Returning the HTML as a Scheme string is ill-advised if the caller is supposed to manually escape characters in it later, since it's harder to figure out at that stage whether you're in a script or style tag. If & encoding chars inside those tags doesn't work right, we shouldn't do that.

Is there a way we can detect a script or style tag with contents that confuse HTML's rules for detecting where the end tag is? That's a similar problem.

dpk commented 3 years ago

If we’re returning bytevectors and not strings, then the output encoding should be in the control of the programmer. (Potentially this makes this library dependent on the encoding library, although for SRFI purposes supporting utf-8 only could be allowed by the spec.)

dpk commented 3 years ago

Of note: the HTML serialization algorithm is written in terms of producing a stream of codepoints, not of bytes

pre-srfi / minimal-html-css-writer

ASCII vs UTF-8 #2