libwww-perl / WWW-Mechanize

Handy web browsing in a Perl object
https://metacpan.org/pod/WWW::Mechanize
Other
68 stars 53 forks source link

mech-dump --headers forces Latin1 on local files #270

Open jidanni opened 5 years ago

jidanni commented 5 years ago

There is absolutely no way to get mech-dump to use the correct character set for UTF-8 on local files.

$ wget jidanni.org $ mech-dump --headers index.html | grep Title Title: [...gobbledygook...] Dan Jacobson

jidanni commented 4 years ago

Apparently, as no Title, comes with HTTP headers, which are expected to all be in ASCII, so when Title is grabbed as a bonus for local files, nobody remembered that they might not be ASCII.

Let's have another look here after dumping parts of website https://jidanni.org/ onto local disk, $ cd jidanni.org/ 02:55 jidanni.org$ mech-dump --headers index.html Content-Length: 3495 Content-Type: text/html Last-Modified: Thu, 09 Jul 2020 12:59:11 GMT Client-Date: Sun, 04 Oct 2020 18:56:05 GMT Title: ç©ä¸¹å°¼ Dan Jacobson X-Meta-Charset: utf-8 X-Meta-Viewport: width=device-width 02:56 jidanni.org$ mech-dump --headers location/paper_mailbox.html Content-Language: zh-tw Content-Length: 2109 Content-Type: text/html Last-Modified: Mon, 27 Jan 2020 21:41:29 GMT Client-Date: Sun, 04 Oct 2020 18:56:17 GMT Title: ç´ä¿¡ç®±èªªæ Paper mailbox instructions X-Meta-Viewport: width=device-width

Anyway, we see there are tons of clues for mech-dump to pick up on: X-Meta-Charset, etc. But it misses them. $ mech-dump --version 2.01

simbabque commented 2 years ago

I can't figure out where this title header is added, but I am pretty sure at that point the encoding is broken. The response knows this is utf-8 and generally mech-dump turns STDOUT into utf8 anyway.