iter_lines is still broken?

psf / requests

A simple, yet elegant, HTTP library.

https://requests.readthedocs.io/en/latest/

Apache License 2.0

52.13k stars 9.32k forks source link

iter_lines is still broken? #5540

Open vitidev opened 4 years ago

vitidev commented 4 years ago

I don't understand.

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

But I can't even get solve this problem by explicitly setting the final delimiter "\n"

  for line in (_.strip() for _ in resp.iter_lines(delimiter=b"\n")):
      print(line)

because if chunk end with "\r\n", then lines = chunk.split(delimiter) in iter_lines append empty extra line (but chunk.splitlines() not)

the same problem and just with "\n" - if chunk end with delimiter, then we get an extra empty line. Which forces you to set a large buffer size if line length is unknown

So if the split(...) adds an extra line, why not remove it?

    if delimiter:
        lines = chunk.split(delimiter)
        if lines and not lines[-1]:
            lines.pop(-1)
    else:
        lines = chunk.splitlines()

sigmavirus24 commented 4 years ago

I can't iterate "\r\n" because "\r" and "\n" can be read into different chunks.

I can't even get solve this problem

I still don't understand your problem but I think you have unrealistic expectations of what iter_lines does. Your issue isn't cogent enough for me to make sense of it and sounds like you're having trouble using the API which makes this appropriate for StackOverflow. You've not identified a clear bug (from what little I can understand in your issue) with the documented API

vitidev commented 4 years ago

I expect that

with open('src.txt', 'r') as f:
      for line in f:

and with the same file

resp = requests.get('http://127.0.0.1:5000/src.txt', stream=True)
       for line in resp.iter_lines(delimiter=b"\n"):

Should be equivalent

file src.txt contains "\r\n" and I don't expect it to work correctly because the current implementation is broken #4629, #3894, #4271, #2431

So I use an explicit single byte delimiter "\n", and expect everything to be fine.

But in this case internals begins to be used chunk.split, which leads to the appearance of empty lines (which are not in the original file.

I see this is an old known issue. Here I see in the comments to the code requests/models.py (v3.0)

  # Calling `.split(delimiter)` will always end with whatever text
  # remains beyond the delimiter, or '' if the delimiter is the end
  # of the text.  On the other hand, `.splitlines()` doesn't include
  # a '' if the text ends in a line delimiter.
  #
  # For example:
  #
  #     'abc\ndef\n'.split('\n')  ~> ['abc', 'def', '']
  #     'abc\ndef\n'.splitlines() ~> ['abc', 'def']
  #
  # So if we have a specified delimiter, we always pop the final
  # item and prepend it to the next chunk.
  #
  # If we're using `splitlines()`, we only do this if the chunk
  # ended midway through a line.

so this is both a description of the problem and a question - why not fix this aspect for versions 2.x? I don't see any breaking changes here. the function will stop producing empty strings that it shouldn't already. And I will be able to use built in iter-lines functionality instead of custom implementation.

sigmavirus24 commented 4 years ago

why not fix this aspect for versions 2.x? I

3.0 is never happening, so it won't be backported to 2.x

I don't see any breaking changes here.

I'm not surprised. Unfortunately a lot of things are incredibly closely attached to the current behaviour of this library and even if we can say "Well they shouldn't be relying on that" it would still break those folks and would be backwards incompatible as a result. The unfortunate reality is that you and I can't predict every way that someone uses this library. Just because you can't see a reason for using it that way, doesn't mean someone is and strong backwards-compatibility is important at this stage.

If 3.0 were to ever happen, that would be the time to change this behaviour but 3.0 isn't happening

vitidev commented 4 years ago

The unfortunate reality is that you and I can't predict every way that someone uses this library.

I can not agree.

The changes are different. And specifically, these changes do not break anything.

The method generates random empty lines. And this is only a special case when the delimiter is specified And the user only has 2 options:

ignore (filter) such empty lines, if there is knowledge that the source should not contain empty lines
not use a method at all, but its own implementation, because outside of the method, you cannot separate fake empty lines from real ones from the source.

And if the method stops generating random empty lines, then will be no point in filtering, but it won't break anything

It is impossible to imagine that someone would base their logic on a side property (actually a bug) of the "randomly generate empty lines" method. To base logic on a side effect of a method that can (and should) be corrected at any time is extremely dumb

And for those who do not use an explicit delimiter, nothing will change at all.

saulpw commented 1 year ago

@calpeterson ran into this same issue, and filed saulpw/visidata#1704.

I agree with @vitidev that iter_lines() should behave identically to iterating over the exact same file contents on local disk using a standard file pointer. Otherwise it is a subtly wrong function, as it can generate spurious blank lines if the delimiter happens to straddle a chunk boundary.

Tracking down the various PRs, it seems like the fix in #3984 was eventually merged into a proposed 3.0 branch, but as @sigmavirus24 says above, "3.0 isn't happening". So if I understand correctly, iter_lines() has a fundamental bug, no fix has been allowed into the current major version, and there will be no next major version.

Please allow a fix for this issue to be merged into the mainline requests library. I assure you, more people are depending on the expected behavior than are depending on the broken behavior, and it is only a matter of time until they hit this bug.

anjakefala commented 1 year ago

I am curious about the claim that "3.0 is never happening". It contradicts this page where it seems requests is open to major releases, just they would be infrequent: https://requests.readthedocs.io/en/latest/community/release-process/#major-releases

bemoody commented 11 months ago

I'd be happy to try to develop a fix for this issue.

So long as the issue isn't fixed, it's a trap for the unwary, and the documentation should point out that it's broken. Currently the documentation actively encourages people to use iter_lines together with stream=True.

bemoody commented 11 months ago

The following might work but is untested.

diff --git a/src/requests/models.py b/src/requests/models.py
index 44556394..6683ad92 100644
--- a/src/requests/models.py
+++ b/src/requests/models.py
@@ -861,6 +861,13 @@ class Response:

         pending = None

+        if decode_unicode:
+            lf = '\n'
+            cr = '\r'
+        else:
+            lf = b'\n'
+            cr = b'\r'
+
         for chunk in self.iter_content(
             chunk_size=chunk_size, decode_unicode=decode_unicode
         ):
@@ -869,18 +876,48 @@ class Response:

             if delimiter:
                 lines = chunk.split(delimiter)
-            else:
-                lines = chunk.splitlines()
-
-            if lines and lines[-1] and chunk and lines[-1][-1] == chunk[-1]:
+                # If the chunk ends with the delimiter, then lines[-1]
+                # is empty.  Otherwise lines[-1] is not empty.  Either
+                # way, lines[-1] shouldn't be treated as a separate
+                # line, but should be prepended to the following
+                # chunk.
                 pending = lines.pop()
             else:
-                pending = None
+                lines = chunk.splitlines()
+                # If the chunk ends with 'XYZ\n' or 'XYZ\r\n', then
+                # lines[-1] is 'XYZ'.  Either way, the following byte
+                # is the start of a new line.
+                if chunk.endswith(lf):
+                    pending = None
+                # If the chunk ends with 'XYZ\r', then lines[-1] is
+                # 'XYZ'.  We don't know yet if the next byte will be
+                # '\n' (and should be treated as part of this line's
+                # ending) or something else (and is the start of a new
+                # line.)  So leave this line, with its '\r', pending.
+                elif chunk.endswith(cr):
+                    pending = lines.pop() + cr
+                # If the chunk doesn't end with '\r' or '\n', then
+                # lines[-1] isn't a complete line and should be
+                # prepended to the following chunk.
+                else:
+                    pending = lines.pop()

             yield from lines

         if pending is not None:
-            yield pending
+            if not delimiter:
+                # If the final chunk ends with CR, then CR will have
+                # been included in 'pending' above, and should be
+                # discarded.  Note that when delimiter = None, the
+                # caller wants splitlines() behavior, which means not
+                # generating a final empty string if the response ends
+                # with a line terminator.
+                yield pending.rstrip(cr)
+            else:
+                # When delimiter != None, the caller wants split()
+                # behavior, which means generating a final empty
+                # string if the response ends with the delimiter.
+                yield pending

     @property
     def content(self):

bemoody commented 11 months ago

I'll say one more thing on the question of backward compatibility.

I found 82 instances of .iter_lines( in Debian: https://codesearch.debian.net/search?literal=1&q=.iter_lines(+filetype%3Apython+-package%3Arequests

Based on a very quick assessment:

29 instances are unrelated to Requests.
25 instances are broken in some way because of this bug. (I didn't count how many of them were using stream=True.)
25 instances are not broken because they ignore blank lines.
3 instances I couldn't immediately assess.

So that's 50% of actual users of this method that are currently broken, and 0% that would be rendered broken by fixing this bug.