Open jared-martin opened 5 years ago
Update: wrapping the HTTP response in a class that trivially implements tell
makes this work as expected.
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)
import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)
class HTTPResponseWithTell(object):
def __init__(self, http_response):
self.http_response = http_response
self.number_of_bytes_read = 0
def tell(self):
return self.number_of_bytes_read
def read(self, *args, **kwargs):
buffer = self.http_response.read(*args, **kwargs)
self.number_of_bytes_read += len(buffer)
return buffer
response_with_tell = HTTPResponseWithTell(response)
blob.upload_from_file(response_with_tell)
This reads the response 1 MB at a time and uploads it to cloud storage without ever storing the whole thing in memory.
However, after reading through the code and understanding ResumableUpload
a little bit better, the point seems to be that unseekable streams are not resumable, since seek is required to resume an upload from where it left off in the event of failure. There doesn't seem to be a supported option for uploading data in chunks that is not strictly "resumable".
Thanks for providing this feedback. It seems we would need to alter the inner workings to not depend on being able to reverse through the stream. This is supported, to my knowledge, in our node client so it isn't an unreasonable ask for Python.
Thanks for the feedback!
This would definitely be super helpful for our team as well!
Some of the data we're trying to upload is generated from generators/iterables, which we wrap in a custom read-only subclass of io.RawIOBase
.
I'm trying to use
Blob.upload_from_file
to upload anhttp.client.HTTPResponse
object without saving it to disk first. It seems like this, or a version of this that wraps theHTTPResponse
in anio
object, should be possible.However, because the response may be larger than
_MAX_MULTIPART_SIZE
,Blob.upload_from_file
creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:Traceback:
Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?