iftechfoundation / ifdb-suggestion-tracker

Bugs and feature requests for a future IFDB update
10 stars 0 forks source link

viewgame links to some series or genre names containing non-ASCII characters don't work #371

Open jtn20 opened 1 year ago

jtn20 commented 1 year ago

There are two listings for "Cannelé & Nomnom" games: beta and episode 1.

They are marked as belonging to a series called "Cannelé & Nomnom".

Clicking on the series link in either of the game pages produces (in Firefox) a "No results were found" page, with the text field displaying "series:Cannelé & Nomnom" (that's A-tilde and copyright-sign in place of the expected e-acute).

The search result link in the game page source is /search?searchfor=series:Cannel%C3%A9+%26+Nomnom (resolving to this link as an absolute URL). The %-encoded parts (C3 A9) represent the UTF-8 encoding of U+00E9 LATIN SMALL LETTER E WITH ACUTE. (But when interpreted as ISO-8859-1, they produce the é mojibake seen above.)

If I hand-hack the URL to encode the ISO-8859-1 (latin1) representation of that character (i.e., series:Cannel%E9+%26+Nomnom) -- link -- that works (returns both games from the series, sensible search box contents).

I don't know anything about IFDB's internal character representation (and I note open issue #20), but all the pages I've looked at (/viewgame, /editgame, search result) are served with HTTP header Content-Type: text/html; charset=ISO-8859-1, have the same in <meta http-equiv>, and contain literal E9 bytes where "Cannelé" is mentioned. So I guess it's generally Latin-1, and the series name in these games' data is fine, and whatever is transcoding the series name to UTF-8 for the series URL needs squashing.

jtn20 commented 1 year ago

I found some more about this while investigating a probably-distinct but related issue (#372).

First, this affects genres as well. (It would probably also affect languages, systems, etc if there were any non-ASCII instances of those in practice.)

Second, I see that IFDB is perfectly capable of holding Unicode data, including series names. See for example Кащей прячет смерть, a work (the only one) in the series Кащей. And, unlike my original example, the series link from that game page works, probably because it's using HTML/XML entities, same as in #372.

So it's less obvious that squashing search queries down to Latin1 is the right answer (although it would work). I'm not sure whether using entities in the search to work around it is a good idea either.

Probably in the long run the search query should interpret %-encoded bytes as UTF-8. (Possibly only when enabled by a separate new query string parameter like "&utf8", if we care about keeping old external search links, that might rely on the ISO-8859-1 interpretation, working.)

(I note that I have the IFDB Search Plugin installed in my Firefox, and it already thinks that sending %-encoded UTF-8 is the right thing to do, which doesn't currently work.)

dfabulich commented 1 year ago

It seems like you're getting pretty close to cracking this! I invite you to file a PR on our source repository at https://github.com/iftechfoundation/ifdb if you can figure out a fix.