libwww-perl / HTTP-Message

The HTTP-Message distribution contains classes useful for representing the messages passed in HTTP style communication.
https://metacpan.org/pod/HTTP::Message
Other
28 stars 60 forks source link

lwp-download fails with HTTP::Message 6.06 #3

Closed njtaylor closed 1 year ago

njtaylor commented 11 years ago

Found with WWW::YouTube::Download, youtube-download only reads first chunk of 4096 bytes. Using HTTP::Message 6.04 the same download would work. Then tried with lwp-download the same thing happened.

With HTTP::Message 6.06

$ lwp-download 'http://www.youtube.com/watch?v=WLu1fi5FeOY&feature=g-all-xit' kk 5.34 KB received
Illegal field name 'X-Meta-Twitter:card' at /usr/local/libdata/perl5/site_perl/amd64-openbsd/HTML/HeadParser.pm line 207 Transfer aborted. Delete kk? [n] Use of uninitialized value $length in numeric gt (>) at /usr/local/bin/lwp-download line 255, line 1.

$ sudo pkg_add -r p5-HTTP-Message p5-HTTP-Message-6.06->6.04: ok Read shared items: ok

$ rm kk $ lwp-download 'http://www.youtube.com/watch?v=WLu1fi5FeOY&feature=g-all-xit' kk 132 KB received
$

$ curl -i -o kkk 'http://www.youtube.com/watch?v=WLu1fi5FeOY&feature=g-all-xit'
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 132k 0 132k 0 0 178k 0 --:--:-- --:--:-- --:--:-- 193k $ grep -i twitter:card kkk

$

This is using libwww-perl 6.04 $ pkg_info | grep libwww p5-LWP-Protocol-https-6.03 adds support for https to p5-libwww p5-libwww-6.04 library for WWW access in Perl $

njtaylor commented 11 years ago

Patched the HTTP-Message 6.06 removing this change

https://github.com/libwww-perl/http-message/commit/5aaa8144afba437653c3856e1877af9aa7ff900a

Disallow empty field names or field names containing ':'

We still have the documented hack of prefixing field names with ':' to suppress any case normalizations that would occur.

Working again twitter:card only appears in the contents, should never have ever got into headers, true fault is elsewhere.

njtaylor commented 11 years ago

Found problem HTTP::Headers is used by HTML::HeadParser so X-Meta-twitter:card is being placed in the list of headers,

Fixes the problem by excluding fields beginning X-Meta - the following patch has been created for OpenBSD ports package.

$OpenBSD$
--- lib/HTTP/Headers.pm.orig    Sat Oct 20 10:11:21 2012
+++ lib/HTTP/Headers.pm Tue Nov  6 16:21:11 2012
@@ -150,7 +150,7 @@ sub _header
     my($self, $field, $val, $op) = @_;

     Carp::croak("Illegal field name '$field'")
-        if rindex($field, ':') > 1 || !length($field);
+        if ( $field !~ /^X-Meta/ && rindex($field, ':') > 1) || !length($field);

     unless ($field =~ /^:/) {
        $field =~ tr/_/-/ if $TRANSLATE_UNDERSCORE;
ghost commented 11 years ago

BBC News articles have meta name="twitter:card" value="summary" in their articles now. When parsing out DCTERMS.modified meta headers from BBC news reports using HTML::HeadParser this bug (Illegal field name 'X-Meta-Twitter:card') is observed.

Fix above (to Headers.pm) confirmed to solve issue.

ikapelyukhin commented 11 years ago

Stumbled upon this problem too. What's the point of croaking on invalid field names anyway? The web server can produce arbitrary headers, valid or not.

ghost commented 11 years ago

This is a serious problem, that twitter junk is on half the pages I scrape. Why is this still broken after 6 months, especially since somebody provided a patch?

xatier commented 11 years ago

X- prefix means the custom HTTP headers, we need to ignore it, don't we?

ikapelyukhin commented 11 years ago

Well, the script shouldn't die on invalid HTTP header -- that's for sure. As HTML parser shouldn't die on invalid HTML markup.

fkztw commented 11 years ago

Maybe use carp instead of croak?

https://github.com/libwww-perl/http-message/issues/3#issuecomment-10118644 This solution indeed works. I don't get it why authors still let it broken after someone gave a patch 6 months ago...

luc2 commented 11 years ago

without this issue report, i was fucked. thanks to everyone.

shalk commented 11 years ago

I meet the problem when i do this : my $browser = LWP::UserAgent->new(); my $response = $browser->get("http://oreilly.com"); print $response->as_string;

thank you for everyone ,espcially njtaylor's code

ckilling2013 commented 10 years ago

This bug is also seen using the Chef API.. A required header 'x-ops-sign:version' => '1.0' prevents usage. A slight adjustment of the regex resolved my issue.. Fix below

    if ( $field !~ /^X-/ && rindex($field, ':')) > 1 || !length($field);
neilb commented 1 year ago

I think this can be closed – HTTP header field names shouldn't ever include a colon, and all the examples above (which were 9 or so years ago) work fine now. Seems like people have stopped using broken field names?

oalders commented 1 year ago

Thanks, @neilb!