(This is probably related to iftechfoundation/ifdb#508, but I wasn't 100% sure so I raised a new issue.)
The IFDB advanced search page https://ifdb.org/search has folds Show all series names appearing in game listings and Show all genres used in game listings.
On expanding these folds, I see mojibake (wrong characters, indicating UTF-8 interpreted as ISO-8859-1) for some (but not all) items containing non-ASCII characters.
The HTTP headers contain Content-Type: text/xml;charset=ISO-8859-1. This charset looks wrong for the data. I don't know if anything takes any notice of it, but it should probably be fixed.
The XML itself contains <?xml version="1.0" encoding="UTF-8"?>. This is an accurate description of the following XML data.
Some characters (generally, western European ones) are included in the XML literally, encoded in UTF-8. These are the ones that end up showing mojibake / broken queries.
More 'exotic' characters like Cyrillic are included in the XML as entities (e.g. <item>Кащей</item>, <item>Dokidoki★Date</item>. These are the ones that end up unscathed.
I don't know the niceties of the Javascript's subsequent interpretation of this XML, but that suggests that emitting more characters as entities in XML, such that the XML is ASCII-only, would at the very least work around this problem.
(Incidentally, there is one genre name that's not the most lovely UTF-8: the work Coke Is It! has a genre that looks like "Children's" but containing Unicode code point U+0092 (manually constructed search link), which looks like it's intended to be a curly apostrophe in the Win1252 style, but is actually a control character. I was hoping this was crufty ancient data, but actually it was only added in 2018. Perhaps the best thing to do is to correct that game entry and hope no-one does it again.)
(This is probably related to iftechfoundation/ifdb#508, but I wasn't 100% sure so I raised a new issue.)
The IFDB advanced search page https://ifdb.org/search has folds Show all series names appearing in game listings and Show all genres used in game listings.
On expanding these folds, I see mojibake (wrong characters, indicating UTF-8 interpreted as ISO-8859-1) for some (but not all) items containing non-ASCII characters.
Examples of broken display / links:
but others are fine:
Looking at search and browser developer tools, I see that the data for these comes from queries of the form
Issuing these queries myself, I see:
Content-Type: text/xml;charset=ISO-8859-1
. This charset looks wrong for the data. I don't know if anything takes any notice of it, but it should probably be fixed.<?xml version="1.0" encoding="UTF-8"?>
. This is an accurate description of the following XML data.<item>Кащей</item>
,<item>Dokidoki★Date</item>
. These are the ones that end up unscathed.I don't know the niceties of the Javascript's subsequent interpretation of this XML, but that suggests that emitting more characters as entities in XML, such that the XML is ASCII-only, would at the very least work around this problem.
(Incidentally, there is one genre name that's not the most lovely UTF-8: the work Coke Is It! has a genre that looks like "Children's" but containing Unicode code point U+0092 (manually constructed search link), which looks like it's intended to be a curly apostrophe in the Win1252 style, but is actually a control character. I was hoping this was crufty ancient data, but actually it was only added in 2018. Perhaps the best thing to do is to correct that game entry and hope no-one does it again.)