googleapis / python-storage

Apache License 2.0
440 stars 151 forks source link

Upload blob from HTTPResponse #28

Open jared-martin opened 5 years ago

jared-martin commented 5 years ago

I'm trying to use Blob.upload_from_file to upload an http.client.HTTPResponse object without saving it to disk first. It seems like this, or a version of this that wraps the HTTPResponse in an io object, should be possible.

However, because the response may be larger than _MAX_MULTIPART_SIZE, Blob.upload_from_file creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:


from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

blob.upload_from_file(response)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1081, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 991, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 934, in _do_resumable_upload
    predefined_acl=predefined_acl,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 883, in _initiate_resumable_upload
    stream_final=False,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/_upload.py", line 409, in _prepare_initiate_request
    if stream.tell() != 0:
io.UnsupportedOperation: seek

Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?

jared-martin commented 5 years ago

Update: wrapping the HTTP response in a class that trivially implements tell makes this work as expected.

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

class HTTPResponseWithTell(object):

    def __init__(self, http_response):
        self.http_response = http_response
        self.number_of_bytes_read = 0

    def tell(self):
        return self.number_of_bytes_read

    def read(self, *args, **kwargs):
        buffer = self.http_response.read(*args, **kwargs)
        self.number_of_bytes_read += len(buffer)
        return buffer

response_with_tell = HTTPResponseWithTell(response)
blob.upload_from_file(response_with_tell)

This reads the response 1 MB at a time and uploads it to cloud storage without ever storing the whole thing in memory.

However, after reading through the code and understanding ResumableUpload a little bit better, the point seems to be that unseekable streams are not resumable, since seek is required to resume an upload from where it left off in the event of failure. There doesn't seem to be a supported option for uploading data in chunks that is not strictly "resumable".

crwilcox commented 4 years ago

Thanks for providing this feedback. It seems we would need to alter the inner workings to not depend on being able to reverse through the stream. This is supported, to my knowledge, in our node client so it isn't an unreasonable ask for Python.

Thanks for the feedback!

jiajie-chen-havas commented 4 years ago

This would definitely be super helpful for our team as well! Some of the data we're trying to upload is generated from generators/iterables, which we wrap in a custom read-only subclass of io.RawIOBase.