piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.17k stars 383 forks source link

Support botocore.response.StreamingBody from boto3 as input to smart_open.smart_open() #148

Closed alexindaco closed 5 years ago

alexindaco commented 6 years ago

I noticed that for boto.s3.key.Key you use io.BufferedReader.

boto3's botocore.response.StreamingBody has the attribute _raw_stream which it looks like you can just pass to io.BufferedReader and it works.

>>> s3 = boto3.resource('s3',
...          aws_access_key_id=ACCESS_KEY,
...          aws_secret_access_key=ACCESS_SECRET)
>>> a=s3.ObjectSummary(BUCKET_NAME, KEY_NAME).get()["Body"]
>>> a
<botocore.response.StreamingBody object at 0x108898690>
>>> stream = io.BufferedReader(a._raw_stream)
>>> stream.readline()
'first line\n'
mpenkov commented 5 years ago

This is currently possible on the master branch:

>>> from smart_open import open
>>> import boto3
>>> s3 = boto3.resource('s3')
>>> summary = s3.ObjectSummary('commoncrawl', 'robots.txt')
>>> stream = summary.get()['Body']._raw_stream
>>> open(stream).read()
'User-Agent: *\nDisallow: /'

I don't think any further work is necessary here. @ceruly Can you please let me know if I've misunderstood something?