piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.19k stars 385 forks source link

http module - buffer does not work? #712

Open grubberr opened 2 years ago

grubberr commented 2 years ago

Hello,

As for me smart_open http module can improve buffering, please look on code sample:

import smart_open
import pandas as pd
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1

fp = smart_open.open("https://github.com/airbytehq/airbyte/files/9280856/test.xlsx", mode="rb")
df = pd.read_excel(fp)
print(df)
$ ./test.py | grep airbytehq/airbyte/files/9280856/test.xlsx
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6478-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=1724-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6694-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3832-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=2536-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3099-\r\n\r\n'

pandas.read_excel read file in random access way, it does a lot of seek and read calls. I suspected if on first HTTP request we read all file contents, subsequent read calls will be from some internal buffer, but I still see that library under the hood continue to make HTTP requests inside small bytes range which already was read on 1-st HTTP request.

Can we improve it? Can we skip additional HTTP request if we already have all needed data from 1-st HTTP request?

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

mpenkov commented 2 years ago

smart_open's main use case is streaming. If your application does a lot of seeking, then it may be better for you to handle buffering separately (e.g. using tempfile).

Ideally, yes, smart_open would be smart enough to buffer the contents of the stream itself, but how do you determine the ideal size of the buffer? Automatically? Using some sort of parameter? It's a fair bit of work.

grubberr commented 2 years ago

As for me it can be any buffer size with some LRU mechanism. The main idea was - don't re-read data from upstream if it's already was read recenently as much as possible.

Yes I agree, it's can be pretty complex task which complicate librabry too much and can entroduce new errors.