piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.18k stars 383 forks source link

GCS decompressive transcoding not supported when reading #422

Open gdmachado opened 4 years ago

gdmachado commented 4 years ago

Problem description

Using smart_open (unreleased version with GCS support) to download files from GCS with transparent decompressive transcoding enabled may lead to incomplete files being downloaded depending on the compressed file size.

With Google Cloud Storage there is the option to store gzip-compressed files & use decompressive transcoding to transparently decompress these when downloading. Decompression is thenhandled by Google servers. In this case, the filename wouldn't have any compression extension (eg. file.csv), however when inspecting it's metadata, it would contain something like that:

{
    "Content-Type": "text/csv; charset=utf-8",
    "Content-Encoding": "gzip"
}

This would be fine if it weren't for the fact that in such cases, Blob()._size will return the compressed size. Since smart_open uses this to understand when to stop reading, it results in incomplete files.

Steps/code to reproduce the problem

write ~400KB file (larger than smart_opens's default buffer size)

$ cat /dev/urandom | gtr -dc A-Za-z0-9 | head -c 400000 > rand.txt

$ ls -l  
total 1024
-rw-r--r--  1 gustavomachado  staff  524288 Feb 17 12:34 rand.txt

upload file to GCS

$ gsutil cp -Z ./rand.txt gs://my-bucket/
Copying file://./rand.txt [Content-Type=text/plain]...
- [1 files][293.8 KiB/293.8 KiB]                                                
Operation completed over 1 objects/293.8 KiB. 

resulting (compressed) file is 293.8 KiB.

check file metadata

$ gsutil stat gs://my-bucket/rand.txt                       
gs://my-bucket/rand.txt:
    Creation time:          Mon, 17 Feb 2020 13:45:36 GMT
    Update time:            Mon, 17 Feb 2020 13:45:36 GMT
    Storage class:          MULTI_REGIONAL
    Cache-Control:          no-transform
    Content-Encoding:       gzip
    Content-Language:       en
    Content-Length:         300842
    Content-Type:           text/plain
    Hash (crc32c):          Ko+ooA==
    Hash (md5):             8C6OlwZIR+fgRMy2xmQqLw==
    ETag:                   CNWW+Kjc2OcCEAE=
    Generation:             1581947136379733
    Metageneration:         1

download file using smart_open (gcloud credentials already set)

>>> from smart_open import open
>>> with open('gs://my-bucket/rand.txt', 'r') as fin:
...     with open('downloaded.txt', 'w') as fout:
...         for line in fin:
...             fout.write(line)
... 
348550

check resulting file size

$ ls -l
total 1472
-rw-r--r--  1 gustavomachado  staff  348550 Feb 17 14:48 downloaded.txt
-rw-r--r--  1 gustavomachado  staff  400000 Feb 17 14:45 rand.txt

original file is 400KB however downloaded file is 348KB. not sure why it's still bigger than the 300842 reported by Google, though.

Versions

Please provide the output of:

>>> import platform, sys, smart_open
>>> print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> print("Python", sys.version)
Python 3.7.2 (default, Dec  9 2019, 14:10:57) 
[Clang 10.0.1 (clang-1001.0.46.4)]
>>> print("smart_open", smart_open.__version__)
smart_open 1.9.0

smart_open has been pinned to 72818ca, installed with

$ pip install git+git://github.com/RaRe-Technologies/smart_open.git@72818ca6d3a0a99e1717ab31db72bf109ac5ce65 

Checklist

Before you create the issue, please make sure you have:

Possible solutions

Setting buffer_size to a value larger than the compressed file size will of course download it in it's entirety, but for large files that would mean loading the entire file into memory.

A reasonable option would be to check Blob().content_type, and if it is equal to 'gzip', call Blob().download_as_string with raw_download=True, and then somehow handle decompression internally with the already-existing decompression mechanisms

If the maintainers agree this would be a viable solution, I'll be happy to provide a PR implementing it.

petedannemann commented 4 years ago

Thanks for reporting this and providing clear descriptions and solutions! Your possible solution seems like a good start, but how will we be able handle compression formats other than gzip? Could we just create a transport_param to toggle raw_download?

I think we will need to come up with a way to tag compressed files with their appropriate content_type metadata on upload as well.

petedannemann commented 4 years ago

I misunderstood, other compression formats would not have this problem as they are not transparently decompressed by Google. I think we can get away with just creating a raw_download option for smart_open.gcs.open and then the user can decompress data returned by smart_open.gcs.SeekableBufferedInputBase.read

mpenkov commented 4 years ago

It'd be good if we can avoid adding more options. Can you think of a logical way to handle this?

Why don't we just perform all the compression/decompression on our side?

petedannemann commented 4 years ago

Ok @gdmachado's suggestion is probably what we want to do then. If Blob.content_encoding == 'gzip' and file_extension != '.gz' then we save state so smart_open knows the file was transcoded to gzip and download the raw compressed data which will have a size true to Blob.size. I think smart_open.gcs.SeekableBufferedInputBase.read will have to use this state to know to decompress the data before it is returned to the user.

hoverinc-frankmata commented 4 years ago

I can confirm this problem, though I noticed it in a different manner. I tested with a file similar to above, but used "transport_params=dict(buffer_size=1024)" to force it to stream in parts. I have 4 cases to share with you. Testing with gcs files named file.txt and file.txt.gz, and ignore_ext=True and False. Data was uploaded to gcs

I modified the smart open code to add the "raw_download=True" to the download_as_string() call, and here are the results.

Hope this helps y'all to find a good solution. Included below is the quick and dirty script I used to demo this.

set -e
GCP_BUCKET=???
echo "START" > rand.txt
cat /dev/urandom | gtr -dc A-Za-z0-9 | head -c 40000 >> rand.txt
echo "END" >> rand.txt
ls -al rand.txt
gsutil cp -Z rand.txt ${GCP_BUCKET}/rand.txt
gsutil cp -Z rand.txt ${GCP_BUCKET}/rand.txt.gz
gsutil ls -l ${GCP_BUCKET}/rand.txt ${GCP_BUCKET}/rand.txt.gz

for f in ${GCP_BUCKET}/rand.txt ${GCP_BUCKET}/rand.txt.gz ; do
  for ignore_ext in True False; do
    python -c "
from smart_open import open
import gzip
print('\n\n--------${f} ignore_ext=${ignore_ext}----------')
with open('${f}', 'rb', ignore_ext=${ignore_ext}, transport_params=dict(buffer_size=1024)) as f:
  data = f.read()
  print(len(data))
  print(data[:5])
  print(data[-4:])
  raw = gzip.decompress(data)
  print(len(raw))
  print(raw[:5])
  print(raw[-4:])
print('Done')
    " || echo "Function failed"
  done
done