Open jidanni opened 5 years ago
Apparently, as no Title, comes with HTTP headers, which are expected to all be in ASCII, so when Title is grabbed as a bonus for local files, nobody remembered that they might not be ASCII.
Let's have another look here after dumping parts of website https://jidanni.org/ onto local disk, $ cd jidanni.org/ 02:55 jidanni.org$ mech-dump --headers index.html Content-Length: 3495 Content-Type: text/html Last-Modified: Thu, 09 Jul 2020 12:59:11 GMT Client-Date: Sun, 04 Oct 2020 18:56:05 GMT Title: ç©ä¸¹å°¼ Dan Jacobson X-Meta-Charset: utf-8 X-Meta-Viewport: width=device-width 02:56 jidanni.org$ mech-dump --headers location/paper_mailbox.html Content-Language: zh-tw Content-Length: 2109 Content-Type: text/html Last-Modified: Mon, 27 Jan 2020 21:41:29 GMT Client-Date: Sun, 04 Oct 2020 18:56:17 GMT Title: ç´ä¿¡ç®±èªªæ Paper mailbox instructions X-Meta-Viewport: width=device-width
Anyway, we see there are tons of clues for mech-dump to pick up on: X-Meta-Charset, etc. But it misses them. $ mech-dump --version 2.01
I can't figure out where this title
header is added, but I am pretty sure at that point the encoding is broken. The response knows this is utf-8 and generally mech-dump turns STDOUT into utf8 anyway.
There is absolutely no way to get mech-dump to use the correct character set for UTF-8 on local files.
$ wget jidanni.org $ mech-dump --headers index.html | grep Title Title: [...gobbledygook...] Dan Jacobson