Closed cdeler closed 3 years ago
(h11) h11/bench % PYTHONPATH=.. python benchmarks/benchmarks.py
5877.3 requests/sec
5874.9 requests/sec
5529.5 requests/sec
5765.3 requests/sec
5903.8 requests/sec
5624.4 requests/sec
5861.4 requests/sec
(h11) h11/bench % PYTHONPATH=.. python benchmarks/benchmarks.py
6161.7 requests/sec
6201.7 requests/sec
6268.1 requests/sec
6069.3 requests/sec
6105.4 requests/sec
6256.2 requests/sec
6294.8 requests/sec
If performance is an issue, maybe look for "\r" or "\n" first.
If you see first "\r" use "\r\n" as the delimiter, otherwise use "\n".
@bdraco , yes, b"test-test-test".split(b"-")
is much faster than regex::split(...)
After some performance investigations I changed the algorithm. For now the changeset looks file
maybe_extract_lines
works for now?self._data
buffer until the "\n\r?\n"
, which gives data
data
using "\r?\n"
regex, which gives delimiter
.split(delimiter)
(it much faster than regex.split(data)
)6099.7 requests/sec
6172.2 requests/sec
6094.7 requests/sec
6202.1 requests/sec
6303.3 requests/sec
6358.0 requests/sec
6372.2 requests/sec
6065.5 requests/sec
6211.9 requests/sec
6211.1 requests/sec
6131.1 requests/sec
6333.1 requests/sec
6103.8 requests/sec
6260.0 requests/sec
Running
PYTHONPATH=.. py-afl-fuzz -o results -i afl-server-examples/ -- python ./afl-server.py
for 1 cycle, I didn't get any crashes/hangs
@bdraco do you mind verifying the REST sensor Issue I'm having (Issue #42608) is caused by the issue described here? This all started when Home Assistant version .117.x was released. I posted server response examples here: https://github.com/home-assistant/core/issues/42608#issuecomment-725312591. The server response comes from the latest official Glances API.
I'm not running glances so it's not something I'm in a position to verify. Their would be best to checkout this PR and update the system that is experiencing the issue.
@bdraco thank you for responding. I not sure how to check out this PR. Any ideas which Home Assistant beta version will include the changes from this PR?
Let's keep the discussion in the Home Assistant issue as it's off topic for this PR
Hello @pgjones I made the changes you mentioned.
Then I ran again the benchmark against the changes:
6290.6 requests/sec
6264.5 requests/sec
6302.6 requests/sec
6328.4 requests/sec
6344.2 requests/sec
6340.0 requests/sec
6347.9 requests/sec
6479.7 requests/sec
6480.7 requests/sec
6502.3 requests/sec
6317.8 requests/sec
6458.3 requests/sec
6238.8 requests/sec
6278.7 requests/sec
Looking over the data gathered in the issue history, I think the semantics we want are:
\r\n
and plain \n
-- this definitely happens in practice\n\r?\n
because it automatically tolerates this + is simpler to match than \r?\n\r?\n
\r\n
exactly, no need to support broken peers that use \n
We might reduce the scope of this PR very slightly by changing the order in which we tackle things?...
.compress()
and ._start
in favour of plain bytearray
truncation...It's possible that I'm being overly cautious there, but it doesn't read to me as an easily reviewable PR just yet, and dropping out that little bit of complexity first seems sensible, right?
@njsmith
Thank you for your response, I'm tuning the PR according the suggestion
The only thing which I don't understand is:
for headers, tolerate a mix of \r\n and plain \n -- this definitely happens in practice
does it mean that we should accept headers like that
b"HTTP/1.1 200 OK\r\n"
b"Accept-Ranges: bytes\n"
b"Content-type: text/plain\n"
b"Connection: close\r\n"
For now I'm detecting the real_lines_delimiter
using regex b"\r?\n"
(b"\r\n"
in the case above) and using data.split(real_lines_delimiter)
, which is faster than regex.split(some_bytes)
does it mean that we should accept headers like that
b"HTTP/1.1 200 OK\r\n" b"Accept-Ranges: bytes\n" b"Content-type: text/plain\n" b"Connection: close\r\n"
Yes, unfortunately. One of the real examples mentioned in the other thread was a server that only used \n
for one header (presumably because most of the response was generated by a standard library that knows how HTTP works, and then that one header was added in by some hacky one-off code written by someone who wasn't familiar with the RFCs).
For now I'm detecting the
real_lines_delimiter
using regexb"\r?\n"
(b"\r\n"
in the case above) and usingdata.split(real_lines_delimiter)
, which is faster thanregex.split(some_bytes)
Here's an idea: what if we always split on \n
, so we get lines like:
[
b"HTTP/1.1 200 OK\r",
b"Accept-Ranges: bytes",
b"Content-type: text/plain",
b"Connection: close\r",
]
and then we handle the trailing \r
as part of the header name/value parsing? That already involves doing a regex match on each header line and discarding trailing whitespace...
Hello @njsmith ,
Sorry for the delay with an answer, looks like we have a gap between TimeZones
Wanted not to decrease master
branch performance, I tested 3 different maybe_extract_lines()
: one with regex split, two with an approach, suggested by @njsmith (to split lines on b"\n"
). All the solutions successfully pass these tests.
regex.split(...)
Bench result is:
6223.9 requests/sec
6247.8 requests/sec
6233.1 requests/sec
6250.9 requests/sec
6231.0 requests/sec
6246.4 requests/sec
6224.8 requests/sec
[it.rstrip(b"\r") ...]
Bench result is:
6202.6 requests/sec
6430.5 requests/sec
6444.7 requests/sec
6443.1 requests/sec
6437.4 requests/sec
6443.1 requests/sec
6415.2 requests/sec
map(...)
Bench result is:
6484.1 requests/sec
6497.6 requests/sec
6321.0 requests/sec
6475.9 requests/sec
6485.2 requests/sec
6526.2 requests/sec
6502.7 requests/sec
master
branch bench (522b00470b5776dda2dc7e98ff778eb1e418e6ad)6412.4 requests/sec
6349.3 requests/sec
6545.3 requests/sec
6530.8 requests/sec
6546.8 requests/sec
6508.1 requests/sec
6421.0 requests/sec
Example with map(...)
is the best solution I found
@pgjones thank you for your approval!
I'm sorry, I hadn't seen that before I pushed the new piece of changes (a94d7280d4d929d5075e902e7deae926d82559fc), speeding up the bench (map
example from https://github.com/python-hyper/h11/pull/115#issuecomment-731624909)
Sketching out @njsmith's "what if we always split on \n ... and then we handle the trailing \r as part of the header name/value parsing?", I think that'd look like this:
BLANK_LINE_REGEX = re.compile(b"\n\r?\n", re.MULTILINE)
class ReceiveBuffer(object):
def __init__(self):
self._data = bytearray()
self._next_line_search = 0
self._multiple_lines_search = 0
def __iadd__(self, byteslike):
self._data += byteslike
return self
def maybe_extract_at_most(self, count):
"""
Extract a fixed number of bytes from the buffer.
"""
out = self._data[:count]
if not out:
return None
self._data[:count] = b''
self._next_line_search = 0
self._multiple_lines_search = 0
return out
def maybe_extract_next_line(self):
"""
Extract the first line, if it is completed in the buffer.
"""
# Only search in buffer space that we've not already looked at.
partial_buffer = self._data[self._next_line_search:]
partial_idx = partial_buffer.find(b"\n")
if partial_idx == -1:
self._next_line_search = len(self._data)
return None
# Truncate the buffer and return it.
idx = self._next_line_search + partial_idx
out = self._data[:idx].rstrip(b"\r")
self._data[:idx + 1] = b""
self._next_line_search = 0
self._multiple_lines_search = 0
return out
def maybe_extract_multiple_lines(self):
"""
Extract everything up to the first blank line, and return a list of lines.
"""
# Handle the case where we have an immediate empty line.
if self._data[:1] == b"\n":
self._data[:1] = b""
self._next_line_search = 0
self._multiple_lines_search = 0
return []
if self._data[:2] == b"\r\n":
self._data[:2] = b""
self._next_line_search = 0
self._multiple_lines_search = 0
return []
# Only search in buffer space that we've not already looked at.
partial_buffer = self._data[self._multiple_lines_search:]
match = BLANK_LINE_REGEX.find(partial_buffer)
if match is None:
self._multiple_lines_search = len(self._data) - 2
return None
# Truncate the buffer and return it.
idx = self._multiple_lines_search + match.span[-1]
out = self._data[:idx]
lines = [line.rstrip(b"\r") for line in out.split(b"\n")]
self._data[:idx] = b""
self._next_line_search = 0
self._multiple_lines_search = 0
assert lines[-2] == lines[-1] == b""
return lines[:-2]
Add could potentially also drop the rstrip(b"\n")
handling in favour of performing that within the header parsing.
I think? this maybe ends up reading more simply, than methods which take a general purpose regex?
I've not performance tested it our tried it out against the test cases - just a first sketch at this point.
@tomchristie you mean to make maybe_extract_next_line
and maybe_extract_multiple_lines
independent?
Well it's definitely more readable than now, e.g. do not require 2 mutual-dependent arguments (needle_regex and max regex length).
The only one thing worrying me: I'm note sure if we can truncate the buffer manually until https://github.com/python-hyper/h11/pull/116 has been merges (I'm talking about self._data[:idx] = b""
)
I'm note sure if we can truncate the buffer manually until #116 has been merged
Possibly. Given that it's only a performance implication on Python 2, it might not necessarily need to be strictly blocked on that. We'd probably want them to occur in the same release tho.
@tomchristie Having taken your code, and slightly changed it: 1.
- self._multiple_lines_search = len(self._data) - 2
+ self._multiple_lines_search = max(0, len(self._data) - 2)
(without that self._multiple_lines_search
might be negative)
2.
- idx = self._next_line_search + partial_idx
- out = self._data[:idx].rstrip(b"\r")
- self._data[:idx + 1] = b""
+ idx = self._next_line_search + partial_idx + 1
+ out = self._data[:idx]
+ self._data[:idx] = b""
(chunk reader expects that out
ends with b"\r\n"
)
I estimated the performance on python3.8.5. Comparing it with the latest master
I got the following results:
master
(522b00470b5776dda2dc7e98ff778eb1e418e6ad)6284.0 requests/sec
6335.5 requests/sec
6443.5 requests/sec
6456.7 requests/sec
6430.6 requests/sec
6409.8 requests/sec
6475.6 requests/sec
6410.8 requests/sec
6406.4 requests/sec
6440.8 requests/sec
6477.5 requests/sec
6206.4 requests/sec
6397.7 requests/sec
6407.0 requests/sec
From my point of view it's almost equal (for both branches I've chosen the best bench.py
run results)
@tomchristie if you don't mind I can push these changes
if you don't mind I can push these changes
@cdeler Sure thing yup.
I also looked at #120 in isolation, but this PR might end up making it redundant.
chunk reader expects that out ends with b"\r\n"
That assumption might not be true after the change, so might be something to review there.
Update: I'm referring here to the line in _readers.py
, which is currently self._bytes_to_discard = 2
. There's presumably no longer necessarily true after this update.
@tomchristie
Update: I'm referring here to the line in _readers.py, which is currently self._bytes_to_discard = 2. There's presumably no longer necessarily true after this update.
As @njsmith wrote there
for chunked-encoding, we should only support \r\n exactly, no need to support broken peers that use \n
I think in this case maybe_extract_next_line
should search for b"\r\n"
. Then the function might look like:
def maybe_extract_next_line(self):
"""
Extract the first line, if it is completed in the buffer.
"""
# Only search in buffer space that we've not already looked at.
search_start_index = max(0, self._next_line_search - 1)
partial_buffer = self._data[search_start_index :]
partial_idx = partial_buffer.find(b"\r\n")
if partial_idx == -1:
self._next_line_search = len(self._data)
return None
# Truncate the buffer and return it.
# 2 == len(b"\r\n")
idx = search_start_index + partial_idx + 2
out = self._data[:idx]
self._data[:idx] = b""
self._next_line_search = 0
self._multiple_lines_search = 0
return out
@tomchristie @njsmith
Sorry for disturbing you, but is there any estimated date regarding this PR & next release processing flow? May be I can do something to make the reviewing easier?
I'm asking since some users from one downstream project (home-assistant
) are suffering with problems, when some sensors or devices are not accessible due to some problems with line delimiters (e.g. https://github.com/home-assistant/core/issues/42415)
First up...
Based on this: https://github.com/home-assistant/core/issues/42415#issuecomment-731127651
Have we confirmed that the issue they're seeing is resolved by handling non-RFC-compliant line endings?
If so I can certainly make some more review time available, and help get this towards a release.
@tomchristie The credentials provided by @Kane610 produces 401, so I still haven't verified the latest fix with an original device, but I can confirm that I checked the original PR versus one affected device, and the changeset fixed the problem.
I'll try to ping @Kane610 or someone in the https://github.com/home-assistant/core/issues/42415 to obtain credentials and check the latest changes against the real device
Hello @tomchristie
Given the credentials provided by @wonderiuy, I checked the fix from this PR.
I can confirm that this fix solves an error with access to some axis
resource
requests
lib (they used it before):In [1]: import requests
In [2]: from requests.auth import HTTPDigestAuth
In [3]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [4]: requests.get(url, auth=HTTPDigestAuth("[REDACTED]", "[REDACTED]"))
Out[4]: <Response [200]>
httpx
without fix:In [1]: import httpx
In [2]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [3]: auth = httpx.DigestAuth("[REDACTED]", "[REDACTED]")
In [4]: client = httpx.Client(auth=auth)
In [5]: client.get(url)
RemoteProtocolError: peer unexpectedly closed connection
httpx
with fix:In [1]: import httpx
In [2]: url = "http://[REDACTED]/axis-cgi/pwdgrp.cgi?action=get"
In [3]: auth = httpx.DigestAuth("[REDACTED]", "[REDACTED]")
In [4]: client = httpx.Client(auth=auth)
In [5]: client.get(url)
Out[5]: <Response [200 OK]>
How would I go about if I'd want to test cdelers fix locally? I'd be happy to verify as well.
You can use the address and credentials i've given to you, firmware on that device is 5.20.5
@wonderiuy I meant getting his changes to my local filesystem :)
It was really simple with GitHub desktop to clone this branch and try it out. Works well on both 5.51 and 5.75 as well as on newer firmwares. Good job @cdeler !
I can also confirm this is working locally ;) Looking forward for the PR to be approved.
Last week Axis published a new firmware version (5.51.7.2) for my camera (M1034-W). Below you will find the release notes.
Don't know if it has something to do with the issue (LF vs CRLF) in this thread but the first correction (C01) in the new firmware version is handling about "Corrected a newline character".
I thought it would be wise to make a notice about it in this tread.
Corrections in 5.51.7.2 since 5.51.7.1
=======================================
5.51.7.2:C01 Corrected a newline character in pwdgrp.cgi, introduced in 5.51.6, that could cause problems when parsing the response.
5.51.7.2:C02 Corrected an issue that prevented Action Rule Events from sending images via email.
5.51.7.2:C03 Corrected an issue that caused monolith to timeout and respawn during too many connect/disconnect RTSP streaming requests.
5.51.7.2:C04 Added support to enable/disable X-Frame-Options headers in the plainconfig. By default, X-Frame-Options is enabled and its value is set to "sameorigin".
Last week Axis published a new firmware version (5.51.7.2) for my camera (M1034-W). Below you will find the release notes.
Don't know if it has something to do with the issue (LF vs CRLF) in this thread but the first correction (C01) in the new firmware version is handling about "Corrected a newline character".
Thanks! I don't know, but regardless there are other firmwares that don't get this update so it is still needed.
I've checked with the new firmware 5.51.7.2 but the problem is still there, so the fix from @cdeler and @Kane610 is more than welcome :)
Can also confirm that this PR solves the problems with older versions of Axis cameras in Homeassistant.
Hey guys! Any progress on getting this merged?
Oh also, forgot to say: could you also add a news entry for the next release notes? See newsfragments/README.rst
for the details on how to do that.
Hello @njsmith ,
I resolved all remarks apart from this.
I cannot understand how to clearly and carefully fix the broken test case (probably we cannot just remove it).
@cdeler I tweaked your newsfragment to use the correct quoting: in ReST, code literals require double backticks. Super annoying if you're used to markdown, but what can ya do.
@pgjones Note that this PR doesn't quite drop support for py2, but it does change the buffer handling to be O(n**2) on py2, and I'm wondering if we should flag that in the release notes or anything. Or are you planning to drop py2 for real in the next release anyway?
Lets merge this, and drop Py2 for the next release.
This is like a Christmas gift, a big thank you to every1 involved
Hello,
I want to submit a PR which closes https://github.com/python-hyper/h11/issues/7
Why the changes are required
According to this comment in the issue, there are problems with some old servers, which are not totally fit
HTTP/1.1
RFC. The original issue inhttpx
(https://github.com/encode/httpx/issues/1378) describes the situation, where some embedded system developers have to deal with a non-standard serverWhat has been done?
I tried to reimplement the function, which extracts headers for response, using regex
What hasn't done yet
~Performance testing, fuzzing~
(updates) How
maybe_extract_lines
works for now?self._data
buffer until the"\n\r?\n"
, which givesdata
data
using"\r?\n"
regex, which givesdelimiter
.split(delimiter)
(it much faster thanregex.split(data)
)With fix
Without fix (522b004)