dpkt http wrong parse_headers logic

GoogleCodeExporter commented 9 years ago

I have been encountered a problem which I found also some people complaining 
about on the net.

The description on the error from here 
http://stackoverflow.com/questions/6337878/parsing-pcap-files-with-dpkt-python

is about the dpkt.http.parse_headers function
"""
While it seems to effectively parse most of the packets, I'm receiving a 
NeedData("premature end of headers") exception on some. They appear to be valid 
packets within WireShark, so I'm a bit confused as to why the exceptions are 
being thrown.
"""

I investigated the problem and found that the exception is raised if the HTTP 
is not ending with "\r\n\r\n".

I have researched more than 10 Milion Http Request and there wasn't even one 
packet where this exception was relevant.

I changed the parse headers on my installatioin to look like this

def parse_headers(f):
    """Return dict of HTTP headers parsed from a file object."""
    d = {}
    while 1:
        line = f.readline()
        if not line:
            break
        # I commented  this line out because the way this parser works is wrong.
        # When the http does not finish with '\r\n\r\n' the parsing is faild!
        #    raise dpkt.NeedData('premature end of headers')

        # The next two liness are responsibble for stopping the parsing when the
        # line equals '\r\n'
        line = line.strip()
        if not line:
            break
        l = line.split(None, 1)
        if not l[0].endswith(':'):
            raise dpkt.UnpackError('invalid header: %r' % line)
        k = l[0][:-1].lower()
        v = len(l) != 1 and l[1] or ''
        if k in d:
            if not type(d[k]) is list:
                d[k] = [d[k]]
            d[k].append(v)
        else:
            d[k] = v
    return d

and its working great.

Many thanks,
Almog

Original issue reported on code.google.com by almog1...@gmail.com on 21 May 2012 at 9:52

GoogleCodeExporter commented 9 years ago

There's a separate but related bug here: l = line.split(None, 1)

The HTTP RFC indicates the delim character between headers and values is ":".  
dpkt current and this patch assume there's a space following the header name.

i.e., 
Header:Value   
vs. 
Header: Value

Both are valid HTTP headers.  That block is more appropriately:

 15         line = line.strip()
 16         if not line:
 17             break
 18         l = line.split(":", 1)
 19         if len(l) != 2:
 20             raise dpkt.UnpackError('invalid header: %r' % line)
 21         k = l[0].lower()
 22         v = len(l) != 1 and l[1] or ''

I've tested this across a set of 170k HTTP Requests, no issues.

Original comment by jeffrey.guy@gmail.com on 1 Aug 2012 at 9:56

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi, I get the following error:

    raise dpkt.NeedData('short body (missing %d bytes)' % (n - len(body)))
dpkt.dpkt.NeedData: short body (missing 282 bytes)

From these lines in http.py(parse_body):

        if len(body) != n:
            raise dpkt.NeedData('short body (missing %d bytes)' % (n - len(body)))

Any solutions?

Original comment by daw...@un0wn.org on 6 Sep 2012 at 8:03

warjiang / dpkt

dpkt http wrong parse_headers logic #90