podverse / podverse-web

Podverse web app written with React and Next.js
https://podverse.fm/about
GNU Affero General Public License v3.0
81 stars 29 forks source link

Incorrect charset on podcast page #1109

Open mitchdowney opened 1 year ago

mitchdowney commented 1 year ago

It seems like this issue is reproducible in any browser. It seems like it should be easy to fix, but I don't have any ideas yet how to fix it.

https://podverse.fm/podcast/imM-3f34Oj

image

chenasraf commented 1 year ago

Not sure if this will be helpful but; the page is UTF-8 (which is probably correct)

The RSS feed, and encoding detectors confirm with some certainty, that the dynamic content is ISO-8859-1, or windows-1252 (which are equivalent, I think?) - the language seems to be Catalan

Simply changing the encoding on the page charset directly isn't working. I'm not familiar with this repo, how is the backend data being fetched and/or saved? Maybe whatever API generates the content needs to re-encode from the above charsets to UTF-8.

mitchdowney commented 1 year ago

@chenasraf thanks for taking a look!

We parse RSS feed data and save it in our Postgres database. In case it helps, this is our docker-compose for the database, and this is our koa app.ts file for the Podverse API.

Andrew Woods on Mastodon said the database should also use utf8...I actually don't know how our database handles utf8. I could have swore I looked into utf8 in the database before and confirmed we have it set, but my Postgres knowledge is shaky at best, and I could very well be missing something.

chenasraf commented 1 year ago

Sounds like you might need to check the charset of the original RSS feed, then re-encode it as UTF-8. It's not always easy to re-encode but since this is a Node app there is most likely some library for it, maybe the utf8 lib is enough, I haven't looked into it enough yet

chenasraf commented 1 year ago

I think I have the solution for you. Here is some minimal reproducible fix, I am assuming you use axios from what I've seen in some (possibly unrelated) code in the repo, but this can be changed to other request clients, as long as you can get the content as an array buffer.

No external library for the conversion is required, vanilla Node.js should (I believe) be enough.

Here is the gist:

  1. Get the RSS content as array buffer
  2. Get the charset from response headers - default to UTF-8 if not available
  3. Use TextDecoder on the array buffer to decode the text. This will return a UTF-8 string with the correct characters inside.
Click to See Output text from example below (truncated for length) ``` Crims https://www.ccma.cat/catradio/alacarta/crims/ Carles Porta torna a la narració d'històries fosques i criminals després de l'èxit de "Tor, tretze cases i tres morts". Relats plens de suspens i intriga que mantindran els oients enganxats 61 Thu, 27 Oct 2022 19:35:19 +0200 Tue, 14 Feb 2023 11:52:10 +0200 ca-es Corporació Catalana de Mitjans Audiovisuals, SA Catalunya Ràdio Corporació Catalana de Mitjans Audiovisuals, SA crpodcast@catradio.cat Catalunya Ràdio Crims https://statics.ccma.cat/multimedia/jpg/2/0/1632301204602.jpg https://www.ccma.cat/catradio/alacarta/crims/ Carles Porta torna a la narració d'històries fosques i criminals després de l'èxit de "Tor, tretze cases i tres morts". Relats plens de suspens i intriga que mantindran els oients enganxats no Podcasting True Crime no Les cinc morts de la Granja d'Escarp L'any 1935 cinc membres d'una família moren amb poc temps de diferència al municipi de la Granja d'Escarp, al Segrià. Tots amb símptomes de gastroenteritis aguda. Els veïns temen que sigui el començament d'una epidèmia, però aviat s'adonen que no han de patir per una malaltia. És un assassinat múltiple que va fer que, durant un temps, la gent de fora es referís a la Granja com "el poble del verí". Sat, 01 Apr 2023 21:00:00 +0200 L'any 1935 cinc membres d'una família moren amb poc temps de diferència al municipi de la Granja d'Escarp, al Segrià. Tots amb símptomes de gastroenteritis aguda. Els veïns temen que sigui el començament d'una epidèmia, però aviat s'adonen que no han de patir per una malaltia. És un assassinat múltiple que va fer que, durant un temps, la gent de fora es referís a la Granja com "el poble del verí". no https://audios.ccma.cat/multimedia/mp3/1/7/1680093063771.mp3 00:55:56 ```

Here is a working POC of the code, with this specific RSS feed:

axios
  .get('http://dinamics.ccma.cat/public/podcast/catradio/xml/9/5/podprograma1859.xml', {
    responseType: 'arraybuffer',
  })
  .then((r) => {
    const arrBuff = r.data
    const charset = r.headers['content-type'].split('charset=')[1] ?? 'utf-8'
    const decoder = new TextDecoder(charset)
    const text = decoder.decode(arrBuff)
    console.log(text)
  })
mitchdowney commented 1 year ago

@chenasraf thanks! I will give TextDecoder a try.

The tricky part for me is that we parse all RSS feeds server side, and store them in our database...so we have to decode the entire RSS feed contents there. I'll see if I can get it working in our parser locally.

mitchdowney commented 1 year ago

@chenasraf oh I forgot to share this. Harvey on Mastodon said he gets this error when opening that podcaster's RSS feed:

Warning (emacs): File contents detected as iso-latin-1. Consider adding an xml declaration with the encoding specified, or saving as utf-8, as mandated by the xml specification.

It sounds like the xml specification doesn't support anything but utf-8? If that's the case, should we decode into charsets other than utf-8?

Also, I see in this line our parser fetches and decodes the RSS content using utf-8. So would this be where I would check which charset is in the response header, and decode it as something other than utf-8? But then if I do that, do our database columns need to be changed to something other than utf-8? I need to do more research...

chenasraf commented 1 year ago

The line you linked to is indeed where the solution would be.

As for the charset in that XML, I honestly do see the charset there, so not sure why that warning is triggered. If you don't wanna parse the XML I think the headers should be enough for this.

You don't have to change the collation of the DB because what the fix will do is properly encode the bad text in utf8 - so your data should match the table data just fine