UTF-8 being broken? - Githubissues

youradds commented 3 years ago

Hi,

I seem to be having some issues with UTF-8 and HTML::Restrict. So this is a test case:

    my $page = `curl -H "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" -L -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; PageThing http://pagething.com) Gecko/2008052906 Firefox/3.0" --compressed --silent --max-time 10 --location --connect-timeout 10 'xe-vn.com'`;
    use File::Slurp::Unicode;
    use HTML::Restrict;
    my %rules = (
        a       => [qw( href title )],
        b       => [],
        caption => [],
        center  => [],
        em      => [],
        i       => [],
        #img     => [qw( alt border height width src style )],
        li      => [],
        ol      => [],
        p       => [],
        span    => [],
        strong  => [],
        sub     => [],
        sup     => [],
        table   => [qw()],
        tbody   => [],
        td      => [],
        tr      => [],
        u       => [],
        ul      => [],
        title   => [],
        br      => [],
        head    => [],
        div     => [],
        meta    => [qw(name content property)],
        html    => [qw(lang)],
        iframe  => [qw(src)]
    );

    my $hr = HTML::Restrict->new( rules => \%rules, uri_schemes => [ undef, 'http', 'https', 'tel', 'mailto' ] );

    $page = $hr->process($page);
    my $path = "./test.txt";
    write_file($path, $page);

If I run that, the output is corrupted:

Chá»£ xe - Váºn chuyá»n - Just another WordPress site

Yet if I comment out $page = $hr->process($page);, it works (but obviously hasn't removed the html code I don't want)

Chợ xe - Vận chuyển - Just another WordPress site

Is there any way around this?

Thanks

oalders commented 3 years ago

Hi @youradds. I don't see your write_file() function in there. Is the output already corrupted if you just print it to the terminal?

youradds commented 3 years ago

Hi @oalders write_file() is coming from File::Slurp::Unicode.

I managed to find a bit of a dirty work around with;

                    `curl -o "$CFG->{admin_root_path}/tmp/$_->{domain}.txt" -L -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; PageThing http://pagething.com) Gecko/2008052906 Firefox/3.0" --compressed --silent --max-time 10 --location --connect-timeout 10 '$_->{domain}'`;

                    if (-e "$CFG->{admin_root_path}/tmp/$_->{domain}.txt") {
                        $page = read_file("$CFG->{admin_root_path}/tmp/$_->{domain}.txt");
                        unlink("$CFG->{admin_root_path}/tmp/$_->{domain}.txt");
                    }

So basically save it directly from curl (so the encoding is saved), and then read the file. That seems to then render ok after. My guess is that $page needed some kind of encoding to work properly when slurped in from the backtick command. The issue is that we don't know the encoding of the pages so its tricky to decode/encode that variable

Cheers

Andy

oalders commented 3 years ago

Hi Andy,

I'm glad to see you found a solution. I'll close this issue for now. Let me know if you need to re-open it.

Best,

Olaf

oalders / html-restrict

UTF-8 being broken? #42