pouetnet / pouet2.0

The next generation of trumpets. Now with 10% more whining sound.
http://www.pouet.net/
Other
140 stars 28 forks source link

fix rendering of UTF-8 encoded files with DOS/Amiga fonts #114

Closed kajott closed 3 years ago

kajott commented 3 years ago

Mojibake appeared when using the ASCII viewer with the DOS or Amiga fonts on UTF-8 encoded files.

This commit fixes that by not converting text from cp437 or iso-8859-1 to utf-8 if it's already valid UTF-8.

For details, see https://www.pouet.net/topic.php?which=10189&page=8#c573844.

kusma commented 3 years ago

I'm skeptical of this coming back to bite us.

Amiga fonts with utf-8 encoding isn't a thing. Amigas use amiga encoding, and modern machines don't use amiga fonts. So this seems to me to fix a made up problem.

But it's a bit worse; there's no guarantee that no amiga NFO files aren't valid UTF-8, even if they're not UTF-8 encoded. So this might break rendering of old files.

All together, this sounds like it's really a deadline NFO problem, and should be fixed by issuing a sane NFO instead of patching viewers

kajott commented 3 years ago

there's no guarantee that no amiga NFO files aren't valid UTF-8

That's what I was pondering too, but I looked at the CP437 and ISO-8859-1 charts for a looong time and didn't find anything that might be useful in an ASCII art context that is valid UTF-8 and didn't require any other sequence of characters that is not valid UTF-8 to be feasible. To recap, if any character in the 0x80...0xFF region is repeated two or more times, or surrounded by 7-bit ASCII characters, it's no longer valid UTF-8. I've had a very hard time imagining a real-world NFO that doesn't use either of these situations.

Long story short: The theoretical threat exists, but chances for a mis-detection "in the wild" are extremely slim. You may be able to craft adversarial examples, but it doesn't come naturally.

issuing a sane NFO

So we should define what a "sane NFO" actually is.

The year is 2021, and UTF-8 is the de-facto standard for everything that is text, and for good reason. Especially for party results, where people tend to publish entries with all sorts of funky characters in the names. We certainly want to preserve that in results.txt files.

On the other hand, there's the design aspect. ASCII art is a fixed part of the demoscene, and there's no denying that Topaz and its derivatives are excellent (if not the best) fonts to display ASCII art in.

What you're saying now (and what is implemented in Pouet at the time) is that UTF-8 and Topaz are mutually exclusive, i.e. that NFOs can have either non-Latin characters or ASCII art, but not both at the same time. Is this what we want?

kusma commented 3 years ago

You very conveniently left out my context that explains that amiga fonts plus UTF-8 isn't a thing.

Make your PC NFO work fine on PCs, and everything will be fine. You've made an NFP that only renders correctly on fictional setups.

kusma commented 3 years ago

there's no guarantee that no amiga NFO files aren't valid UTF-8

That's what I was pondering too, but I looked at the CP437 and ISO-8859-1 charts for a looong time and didn't find anything that might be useful in an ASCII art context that is valid UTF-8 and didn't requre any other sequence of characters that is not valid UTF-8 to be feasible.

Sound like you haven't done a very good job, then. For example, most CP437 code-point up to 0x20 will decode as valid UTF-8, but have a completely different, yet useful meaning.

For instance, a file that for instance uses in CP437, but no other "special" character will fail with this logic.

That's not really that unlikely to happen. And that's just looking at the one-byte UTF-8 encoding; there's three more to look out for (or five, depending on the UTF-8 version).

kusma commented 3 years ago

Just a quick check for the two-byte encoding: It's entirely possible to use a character like followed by a character like ¬, which again is a valid UTF-8 sequence when written as CP437.

Sure, the longer they get, the less likely they are to happen. But the point is, I think you've been searching trying to prove your hypothesis that this can't happen rather than trying to prove that it can happen.

kajott commented 3 years ago

It's entirely possible to use a character like followed by a character like ¬

That's what I meant by "adversarial examples". How likely is it that such a combination is the only one in the whole file? Can you think of any cp437 ASCII that doesn't use the same line drawing character twice in succession, or at least interleaves it with a 7-bit ASCII character inbetween?

But I get your point. You want to be 100% certain, and 100% certainty is unachievable. So, sure, let's err on the conservative side, in more than one sense of the word ;) ISO-8859-1 NFOs with replacement characters it will be, then.

kusma commented 3 years ago

How on earth is box drawing an "adversarial example"? It's entirely common in NFO files.

kajott commented 3 years ago

Again, the problem is not the ─¬ sequence, the problem is that you'd need a whole file consisting of only "accidentally valid" UTF-8 sequences. One single vertical bar followed by a space or end of line, and poof, no UTF-8. One horizontal bar of any kind (including dashed bars with spaces inbetween), and poof, no UTF-8. While there sure are practical combinations of characters that could be misinterpreted as UTF-8, most can not, and if any of those non-UTF-8 combinations are present in the file, it's no valid UTF-8.

kusma commented 3 years ago

There's a lot of very short NFO files out there with very few special characters. Thise don't fit the pattern you describe.

Gargaj commented 3 years ago

Really the solution would be / should be a hand-curated encoding setting for each NFO; but then who would want to go through 80000 prods? :/

kusma commented 3 years ago

Or maybe re-encode the NFOs as UTF-8 on upload? We could convert all existing ones based in n platform assumptions, and let the users choose/preview when uploading... Probably a lot of work for what seems like a pointlessly artificial corner case, though.

Gargaj commented 3 years ago

I don't like the idea of destructive conversion especially when there's a considerable chance that the user will pick the wrong one.

sagamusix commented 3 years ago

As a first step, you could run through the entirety of uploaded files and see if any of them would be mis-detected by KeyJ's suggested changes. Maybe add a preg_match to check if any characters in the range 0x80-0xFF are present at all, and if there aren't then never assume UTF-8. That would satisfy kusma's counter-example of an CP437 NFO only using character range 0x00-0x7F including control characters.

kusma commented 3 years ago

As a first step, you could run through the entirety of uploaded files and see if any of them would be mis-detected by KeyJ's suggested changes. Maybe add a preg_match to check if any characters in the range 0x80-0xFF are present at all, and if there aren't then never assume UTF-8. That would satisfy kusma's counter-example of an CP437 NFO only using character range 0x00-0x7F including control characters.

No, I don't think it would. We would still be able to misinterpret files that for instance contains in CP437.

Besides, running through all existing NFO-files doesn't prove that it wont lead to trouble if some old, not-yet-uploaded prods contains problems like that. There's a lot of old releases that still hasn't made the way to Pouët.

I want to stress the point that this is trying to solve an entirely artificial problem. UTF-8 based systems don't have Amiga fonts. If the UTF-8 encoded results file from Deadline requires Topaz to render correctly, then Deadline has made created this problem all by themselves. Risking breaking other NFO files just doesn't seem like the right thing to do to me.

I'm not saying I'm against making the encoding of the file and the fonts-selection orthogonal if that's done in a reasonably simple and robust way. The approach taken here is simple, but it's not robust. Manually marking things seems error-prone to me.

One alternative option could be to add an "amiga but with utf-8" entry here. That one seems a bit hard to name in a descriptive way, so if so I would propose "nasty deadline workaround" ;)

Even better would probably be to allow specifying both font and encoding to the page, and fixup the Amiga default-override to also specify the encoding. Bonus points if you also add a similar DOS override. We'd also need some UI to allow switching the encoding as well.

It's not clear to me how to decide what a sane default encoding is for party results, though. UTF-8 is probably the right choice for platform-agnostic parties in recent years, but I'm not really sure what the full heuristic for something like that should look like.

Gargaj commented 3 years ago

I want to stress the point that this is trying to solve an entirely artificial problem. UTF-8 based systems don't have Amiga fonts. If the UTF-8 encoded results file from Deadline requires Topaz to render correctly, then Deadline has made created this problem all by themselves. Risking breaking other NFO files just doesn't seem like the right thing to do to me.

Hard agree on this. Making an NFO look Amiga-ish using UTF-8 is an outlier.

sagamusix commented 3 years ago

Deadline is not the first party using ANSI art in their results and at the same time wanting to represent international entries in the results file as well. As a result, we have seen such abominations like having CP437 ANSI art mixed with Windows-1252 or ISO-8859 prod titles in the same file at other parties. In that case, you can either enjoy the ANSI art or be able actually read the results. I don't think that it must be mutually exclusive wanting to view those files with a nice retro font for the ANSI art part, and fall back to whatever other monospace font could provide the remaining missing characters. If you need examples, just look at the Revision results files. I quickly checked 2017-2019 results and all of them contain CP437 ANSI art but ISO-encoded results text (e.g. Iloé turns into IloΘ if you view the file as CP437).

The only sane solution here that I can think of really is to encode the whole file as UTF-8, which would also allow for e.g. Japanese contributions not ending up completely garbled. And if we go that route, why should we deprive ourselves of viewing the blocky ANSI art with a proper font designed to handle these sort of characters? Yes, there will be font substitution happening, so there may or may not be alignment problems. But there is also absolutely no guarantee that the 'Courier New' , monospace currently used for the normal HTML view will not have any font substitutions leading to the same issue.

Gargaj commented 3 years ago

Deadline is not the first party using ANSI art in their results

They're not tho, they're using UTF-8 art.

kusma commented 3 years ago

I agree that encoding a result file should be done using UTF-8. But I disagree that it's sane to expect such a file to be viewed using Topaz.