python-hyper / h11

A pure-Python, bring-your-own-I/O implementation of HTTP/1.1
https://h11.readthedocs.io/
MIT License
490 stars 62 forks source link

Added ability to use LF, not only CRLF delimiter for response Headers and Body #115

Closed cdeler closed 3 years ago

cdeler commented 3 years ago

Hello,

I want to submit a PR which closes https://github.com/python-hyper/h11/issues/7

Why the changes are required

According to this comment in the issue, there are problems with some old servers, which are not totally fit HTTP/1.1 RFC. The original issue in httpx (https://github.com/encode/httpx/issues/1378) describes the situation, where some embedded system developers have to deal with a non-standard server

What has been done?

I tried to reimplement the function, which extracts headers for response, using regex

What hasn't done yet

~Performance testing, fuzzing~

(updates) How maybe_extract_lines works for now?

  1. first of all it extracts part of self._data buffer until the "\n\r?\n", which gives data
  2. then it determines the line delimiter in data using "\r?\n" regex, which gives delimiter
  3. then split lines using bytearray's .split(delimiter) (it much faster than regex.split(data))

With fix

6099.7 requests/sec
6172.2 requests/sec
6094.7 requests/sec
6202.1 requests/sec
6303.3 requests/sec
6358.0 requests/sec
6372.2 requests/sec

Without fix (522b004)

6065.5 requests/sec
6211.9 requests/sec
6211.1 requests/sec
6131.1 requests/sec
6333.1 requests/sec
6103.8 requests/sec
6260.0 requests/sec
cdeler commented 3 years ago

Benchmark results

with fix

(h11) h11/bench % PYTHONPATH=.. python benchmarks/benchmarks.py
5877.3 requests/sec
5874.9 requests/sec
5529.5 requests/sec
5765.3 requests/sec
5903.8 requests/sec
5624.4 requests/sec
5861.4 requests/sec

without fix (latest master 522b00470b5776dda2dc7e98ff778eb1e418e6ad)

(h11) h11/bench % PYTHONPATH=.. python benchmarks/benchmarks.py
6161.7 requests/sec
6201.7 requests/sec
6268.1 requests/sec
6069.3 requests/sec
6105.4 requests/sec
6256.2 requests/sec
6294.8 requests/sec
bdraco commented 3 years ago

If performance is an issue, maybe look for "\r" or "\n" first.

If you see first "\r" use "\r\n" as the delimiter, otherwise use "\n".

cdeler commented 3 years ago

@bdraco , yes, b"test-test-test".split(b"-") is much faster than regex::split(...)

cdeler commented 3 years ago

After some performance investigations I changed the algorithm. For now the changeset looks file

How maybe_extract_lines works for now?

  1. first of all it extracts part of self._data buffer until the "\n\r?\n", which gives data
  2. then it determines the line delimiter in data using "\r?\n" regex, which gives delimiter
  3. then split lines using bytearray .split(delimiter) (it much faster than regex.split(data))

With fix

6099.7 requests/sec
6172.2 requests/sec
6094.7 requests/sec
6202.1 requests/sec
6303.3 requests/sec
6358.0 requests/sec
6372.2 requests/sec

Without fix (522b00470b5776dda2dc7e98ff778eb1e418e6ad)

6065.5 requests/sec
6211.9 requests/sec
6211.1 requests/sec
6131.1 requests/sec
6333.1 requests/sec
6103.8 requests/sec
6260.0 requests/sec

fuzzing results

Running PYTHONPATH=.. py-afl-fuzz -o results -i afl-server-examples/ -- python ./afl-server.py for 1 cycle, I didn't get any crashes/hangs

mkanet commented 3 years ago

@bdraco do you mind verifying the REST sensor Issue I'm having (Issue #42608) is caused by the issue described here? This all started when Home Assistant version .117.x was released. I posted server response examples here: https://github.com/home-assistant/core/issues/42608#issuecomment-725312591. The server response comes from the latest official Glances API.

bdraco commented 3 years ago

I'm not running glances so it's not something I'm in a position to verify. Their would be best to checkout this PR and update the system that is experiencing the issue.

mkanet commented 3 years ago

@bdraco thank you for responding. I not sure how to check out this PR. Any ideas which Home Assistant beta version will include the changes from this PR?

bdraco commented 3 years ago

Let's keep the discussion in the Home Assistant issue as it's off topic for this PR

cdeler commented 3 years ago

Hello @pgjones I made the changes you mentioned.

Then I ran again the benchmark against the changes:

With changes

6290.6 requests/sec
6264.5 requests/sec
6302.6 requests/sec
6328.4 requests/sec
6344.2 requests/sec
6340.0 requests/sec
6347.9 requests/sec

Without changes

6479.7 requests/sec
6480.7 requests/sec
6502.3 requests/sec
6317.8 requests/sec
6458.3 requests/sec
6238.8 requests/sec
6278.7 requests/sec
njsmith commented 3 years ago

Looking over the data gathered in the issue history, I think the semantics we want are:

tomchristie commented 3 years ago

We might reduce the scope of this PR very slightly by changing the order in which we tackle things?...

It's possible that I'm being overly cautious there, but it doesn't read to me as an easily reviewable PR just yet, and dropping out that little bit of complexity first seems sensible, right?

cdeler commented 3 years ago

@njsmith

Thank you for your response, I'm tuning the PR according the suggestion

The only thing which I don't understand is:

for headers, tolerate a mix of \r\n and plain \n -- this definitely happens in practice

does it mean that we should accept headers like that

b"HTTP/1.1 200 OK\r\n"
b"Accept-Ranges: bytes\n"
b"Content-type: text/plain\n"
b"Connection: close\r\n"

For now I'm detecting the real_lines_delimiter using regex b"\r?\n" (b"\r\n" in the case above) and using data.split(real_lines_delimiter), which is faster than regex.split(some_bytes)

njsmith commented 3 years ago

does it mean that we should accept headers like that

b"HTTP/1.1 200 OK\r\n"
b"Accept-Ranges: bytes\n"
b"Content-type: text/plain\n"
b"Connection: close\r\n"

Yes, unfortunately. One of the real examples mentioned in the other thread was a server that only used \n for one header (presumably because most of the response was generated by a standard library that knows how HTTP works, and then that one header was added in by some hacky one-off code written by someone who wasn't familiar with the RFCs).

For now I'm detecting the real_lines_delimiter using regex b"\r?\n" (b"\r\n" in the case above) and using data.split(real_lines_delimiter), which is faster than regex.split(some_bytes)

Here's an idea: what if we always split on \n, so we get lines like:

[
  b"HTTP/1.1 200 OK\r",
  b"Accept-Ranges: bytes",
  b"Content-type: text/plain",
  b"Connection: close\r",
]

and then we handle the trailing \r as part of the header name/value parsing? That already involves doing a regex match on each header line and discarding trailing whitespace...

cdeler commented 3 years ago

Hello @njsmith ,

Sorry for the delay with an answer, looks like we have a gap between TimeZones

Wanted not to decrease master branch performance, I tested 3 different maybe_extract_lines(): one with regex split, two with an approach, suggested by @njsmith (to split lines on b"\n"). All the solutions successfully pass these tests.

Example with regex.split(...)

Example code The core part is `lines = line_delimiter_regex.split(data.rstrip(b"\r\n"))` ```python def maybe_extract_lines(self): if self._data[self._start : self._start + 2] == b"\r\n": self._start += 2 return [] elif self._data[self._start : self._start + 1] == b"\n": self._start += 1 return [] else: data = self.maybe_extract_until_next(blank_line_delimiter_regex, 3) if data is None: return None lines = line_delimiter_regex.split(data.rstrip(b"\r\n")) return lines ```

Bench result is:

6223.9 requests/sec
6247.8 requests/sec
6233.1 requests/sec
6250.9 requests/sec
6231.0 requests/sec
6246.4 requests/sec
6224.8 requests/sec

Example with list comp [it.rstrip(b"\r") ...]

Example code The core part is `lines = [it.rstrip(b"\r") for it in data.strip(b"\r\n").split(b"\n")]` ```python def maybe_extract_lines(self): if self._data[self._start : self._start + 2] == b"\r\n": self._start += 2 return [] elif self._data[self._start : self._start + 1] == b"\n": self._start += 1 return [] else: data = self.maybe_extract_until_next(blank_line_delimiter_regex, 3) if data is None: return None lines = [it.rstrip(b"\r") for it in data.strip(b"\r\n").split(b"\n")] return lines ```

Bench result is:

6202.6 requests/sec
6430.5 requests/sec
6444.7 requests/sec
6443.1 requests/sec
6437.4 requests/sec
6443.1 requests/sec
6415.2 requests/sec

Example with map(...)

Example code The core part is `lines = list(map(rstrip_line, data.strip(b"\r\n").split(b"\n"))) ` ```python def rstrip_line(line): return line.rstrip(b"\r") ... def maybe_extract_lines(self): if self._data[self._start : self._start + 2] == b"\r\n": self._start += 2 return [] elif self._data[self._start : self._start + 1] == b"\n": self._start += 1 return [] else: data = self.maybe_extract_until_next(blank_line_delimiter_regex, 3) if data is None: return None lines = list(map(rstrip_line, data.strip(b"\r\n").split(b"\n"))) return lines ```

Bench result is:

6484.1 requests/sec
6497.6 requests/sec
6321.0 requests/sec
6475.9 requests/sec
6485.2 requests/sec
6526.2 requests/sec
6502.7 requests/sec

Current master branch bench (522b00470b5776dda2dc7e98ff778eb1e418e6ad)

6412.4 requests/sec
6349.3 requests/sec
6545.3 requests/sec
6530.8 requests/sec
6546.8 requests/sec
6508.1 requests/sec
6421.0 requests/sec

Sum up

Example with map(...) is the best solution I found

cdeler commented 3 years ago

@pgjones thank you for your approval!

I'm sorry, I hadn't seen that before I pushed the new piece of changes (a94d7280d4d929d5075e902e7deae926d82559fc), speeding up the bench (map example from https://github.com/python-hyper/h11/pull/115#issuecomment-731624909)

tomchristie commented 3 years ago

Sketching out @njsmith's "what if we always split on \n ... and then we handle the trailing \r as part of the header name/value parsing?", I think that'd look like this:

BLANK_LINE_REGEX = re.compile(b"\n\r?\n", re.MULTILINE)

class ReceiveBuffer(object):
    def __init__(self):
        self._data = bytearray()
        self._next_line_search = 0
        self._multiple_lines_search = 0

    def __iadd__(self, byteslike):
        self._data += byteslike
        return self

    def maybe_extract_at_most(self, count):
        """
        Extract a fixed number of bytes from the buffer.
        """
        out = self._data[:count]
        if not out:
            return None

        self._data[:count] = b''
        self._next_line_search = 0
        self._multiple_lines_search = 0
        return out

    def maybe_extract_next_line(self):
        """
        Extract the first line, if it is completed in the buffer.
        """
        # Only search in buffer space that we've not already looked at.
        partial_buffer = self._data[self._next_line_search:]
        partial_idx = partial_buffer.find(b"\n")
        if partial_idx == -1:
            self._next_line_search = len(self._data)
            return None

        # Truncate the buffer and return it.
        idx = self._next_line_search + partial_idx
        out = self._data[:idx].rstrip(b"\r")
        self._data[:idx + 1] = b""
        self._next_line_search = 0
        self._multiple_lines_search = 0
        return out

    def maybe_extract_multiple_lines(self):
        """
        Extract everything up to the first blank line, and return a list of lines.
        """
        # Handle the case where we have an immediate empty line.
        if self._data[:1] == b"\n":
            self._data[:1] = b""
            self._next_line_search = 0
            self._multiple_lines_search = 0
            return []

        if self._data[:2] == b"\r\n":
            self._data[:2] = b""
            self._next_line_search = 0
            self._multiple_lines_search = 0
            return []

        # Only search in buffer space that we've not already looked at.
    partial_buffer = self._data[self._multiple_lines_search:]
        match = BLANK_LINE_REGEX.find(partial_buffer)
        if match is None:
            self._multiple_lines_search = len(self._data) - 2
            return None

        # Truncate the buffer and return it.
        idx = self._multiple_lines_search + match.span[-1]
        out = self._data[:idx]
        lines = [line.rstrip(b"\r") for line in out.split(b"\n")]
        self._data[:idx] = b""
        self._next_line_search = 0
        self._multiple_lines_search = 0
        assert lines[-2] == lines[-1] == b""
        return lines[:-2]

Add could potentially also drop the rstrip(b"\n") handling in favour of performing that within the header parsing.

I think? this maybe ends up reading more simply, than methods which take a general purpose regex?

I've not performance tested it our tried it out against the test cases - just a first sketch at this point.

cdeler commented 3 years ago

@tomchristie you mean to make maybe_extract_next_line and maybe_extract_multiple_lines independent?

Well it's definitely more readable than now, e.g. do not require 2 mutual-dependent arguments (needle_regex and max regex length).

The only one thing worrying me: I'm note sure if we can truncate the buffer manually until https://github.com/python-hyper/h11/pull/116 has been merges (I'm talking about self._data[:idx] = b"")

tomchristie commented 3 years ago

I'm note sure if we can truncate the buffer manually until #116 has been merged

Possibly. Given that it's only a performance implication on Python 2, it might not necessarily need to be strictly blocked on that. We'd probably want them to occur in the same release tho.

cdeler commented 3 years ago

@tomchristie Having taken your code, and slightly changed it: 1.

- self._multiple_lines_search = len(self._data) - 2
+ self._multiple_lines_search = max(0, len(self._data) - 2)

(without that self._multiple_lines_search might be negative) 2.

-        idx = self._next_line_search + partial_idx
-        out = self._data[:idx].rstrip(b"\r")
-        self._data[:idx + 1] = b""
+        idx = self._next_line_search + partial_idx + 1
+        out = self._data[:idx]
+        self._data[:idx] = b""

(chunk reader expects that out ends with b"\r\n")

I estimated the performance on python3.8.5. Comparing it with the latest master I got the following results:

master (522b00470b5776dda2dc7e98ff778eb1e418e6ad)

6284.0 requests/sec
6335.5 requests/sec
6443.5 requests/sec
6456.7 requests/sec
6430.6 requests/sec
6409.8 requests/sec
6475.6 requests/sec

Your code (with the patch applied)

6410.8 requests/sec
6406.4 requests/sec
6440.8 requests/sec
6477.5 requests/sec
6206.4 requests/sec
6397.7 requests/sec
6407.0 requests/sec

From my point of view it's almost equal (for both branches I've chosen the best bench.py run results)

@tomchristie if you don't mind I can push these changes

tomchristie commented 3 years ago

if you don't mind I can push these changes

@cdeler Sure thing yup.

I also looked at #120 in isolation, but this PR might end up making it redundant.

tomchristie commented 3 years ago

chunk reader expects that out ends with b"\r\n"

That assumption might not be true after the change, so might be something to review there.

Update: I'm referring here to the line in _readers.py, which is currently self._bytes_to_discard = 2. There's presumably no longer necessarily true after this update.

cdeler commented 3 years ago

@tomchristie

Update: I'm referring here to the line in _readers.py, which is currently self._bytes_to_discard = 2. There's presumably no longer necessarily true after this update.

As @njsmith wrote there

for chunked-encoding, we should only support \r\n exactly, no need to support broken peers that use \n

I think in this case maybe_extract_next_line should search for b"\r\n". Then the function might look like:

    def maybe_extract_next_line(self):
        """
        Extract the first line, if it is completed in the buffer.
        """
        # Only search in buffer space that we've not already looked at.
        search_start_index = max(0, self._next_line_search - 1)
        partial_buffer = self._data[search_start_index :]
        partial_idx = partial_buffer.find(b"\r\n")
        if partial_idx == -1:
            self._next_line_search = len(self._data)
            return None

        # Truncate the buffer and return it.
        # 2 == len(b"\r\n")
        idx = search_start_index + partial_idx + 2
        out = self._data[:idx]
        self._data[:idx] = b""
        self._next_line_search = 0
        self._multiple_lines_search = 0
        return out
cdeler commented 3 years ago

@tomchristie @njsmith

Sorry for disturbing you, but is there any estimated date regarding this PR & next release processing flow? May be I can do something to make the reviewing easier?

I'm asking since some users from one downstream project (home-assistant) are suffering with problems, when some sensors or devices are not accessible due to some problems with line delimiters (e.g. https://github.com/home-assistant/core/issues/42415)

tomchristie commented 3 years ago

First up...

Based on this: https://github.com/home-assistant/core/issues/42415#issuecomment-731127651

Have we confirmed that the issue they're seeing is resolved by handling non-RFC-compliant line endings?

If so I can certainly make some more review time available, and help get this towards a release.

cdeler commented 3 years ago

@tomchristie The credentials provided by @Kane610 produces 401, so I still haven't verified the latest fix with an original device, but I can confirm that I checked the original PR versus one affected device, and the changeset fixed the problem.

I'll try to ping @Kane610 or someone in the https://github.com/home-assistant/core/issues/42415 to obtain credentials and check the latest changes against the real device

cdeler commented 3 years ago

Hello @tomchristie

Given the credentials provided by @wonderiuy, I checked the fix from this PR.

I can confirm that this fix solves an error with access to some axis resource

How I checked?

requests lib (they used it before):

In [1]: import requests
In [2]: from requests.auth import HTTPDigestAuth
In [3]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [4]: requests.get(url, auth=HTTPDigestAuth("[REDACTED]", "[REDACTED]"))
Out[4]: <Response [200]>

httpx without fix:

In [1]: import httpx
In [2]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [3]: auth = httpx.DigestAuth("[REDACTED]", "[REDACTED]")
In [4]: client = httpx.Client(auth=auth)
In [5]: client.get(url)
RemoteProtocolError: peer unexpectedly closed connection

httpx with fix:

In [1]: import httpx
In [2]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [3]: auth = httpx.DigestAuth("[REDACTED]", "[REDACTED]")
In [4]: client = httpx.Client(auth=auth)
In [5]: client.get(url)
Out[5]: <Response [200 OK]>
Kane610 commented 3 years ago

How would I go about if I'd want to test cdelers fix locally? I'd be happy to verify as well.

wonderiuy commented 3 years ago

You can use the address and credentials i've given to you, firmware on that device is 5.20.5

Kane610 commented 3 years ago

@wonderiuy I meant getting his changes to my local filesystem :)

Kane610 commented 3 years ago

It was really simple with GitHub desktop to clone this branch and try it out. Works well on both 5.51 and 5.75 as well as on newer firmwares. Good job @cdeler !

donnib commented 3 years ago

I can also confirm this is working locally ;) Looking forward for the PR to be approved.

hoorna commented 3 years ago

Last week Axis published a new firmware version (5.51.7.2) for my camera (M1034-W). Below you will find the release notes.

Don't know if it has something to do with the issue (LF vs CRLF) in this thread but the first correction (C01) in the new firmware version is handling about "Corrected a newline character".

I thought it would be wise to make a notice about it in this tread.


Corrections in 5.51.7.2 since 5.51.7.1

=======================================

5.51.7.2:C01 Corrected a newline character in pwdgrp.cgi, introduced in 5.51.6, that could cause problems when parsing the response.

5.51.7.2:C02 Corrected an issue that prevented Action Rule Events from sending images via email.

5.51.7.2:C03 Corrected an issue that caused monolith to timeout and respawn during too many connect/disconnect RTSP streaming requests.

5.51.7.2:C04 Added support to enable/disable X-Frame-Options headers in the plainconfig. By default, X-Frame-Options is enabled and its value is set to "sameorigin".

Kane610 commented 3 years ago

Last week Axis published a new firmware version (5.51.7.2) for my camera (M1034-W). Below you will find the release notes.

Don't know if it has something to do with the issue (LF vs CRLF) in this thread but the first correction (C01) in the new firmware version is handling about "Corrected a newline character".

Thanks! I don't know, but regardless there are other firmwares that don't get this update so it is still needed.

wonderiuy commented 3 years ago

I've checked with the new firmware 5.51.7.2 but the problem is still there, so the fix from @cdeler and @Kane610 is more than welcome :)

astrandb commented 3 years ago

Can also confirm that this PR solves the problems with older versions of Axis cameras in Homeassistant.

Kane610 commented 3 years ago

Hey guys! Any progress on getting this merged?

njsmith commented 3 years ago

Oh also, forgot to say: could you also add a news entry for the next release notes? See newsfragments/README.rst for the details on how to do that.

cdeler commented 3 years ago

Hello @njsmith ,

I resolved all remarks apart from this.

I cannot understand how to clearly and carefully fix the broken test case (probably we cannot just remove it).

njsmith commented 3 years ago

@cdeler I tweaked your newsfragment to use the correct quoting: in ReST, code literals require double backticks. Super annoying if you're used to markdown, but what can ya do.

@pgjones Note that this PR doesn't quite drop support for py2, but it does change the buffer handling to be O(n**2) on py2, and I'm wondering if we should flag that in the release notes or anything. Or are you planning to drop py2 for real in the next release anyway?

pgjones commented 3 years ago

Lets merge this, and drop Py2 for the next release.

wonderiuy commented 3 years ago

This is like a Christmas gift, a big thank you to every1 involved