piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.22k stars 383 forks source link

Not able to upload a compressed S3 file using close()/without context manager #837

Closed jbarragan-bridge closed 1 month ago

jbarragan-bridge commented 1 month ago

Problem description

We are using smart_open to compress and upload files to s3 without a context manager: e.g.

hook = S3Hook()
s3_params = {
    'client': hook.get_session().client('s3'),
}
f = smart_open.open('s3://bucket/file.gz', 'w', transport_params=s3_params)
f.write('Some content')
f.close()

Steps/code to reproduce the problem

Adding the following tests to test_s3.py (class MultipartWriterTest) shows our problem:

    def test_write_gz_using_context_manager(self):
        """Does s3 multipart upload create a compressed file using context manager?"""
        contents = b'get ready for a surprise'
        with smart_open.open(
                f's3://{BUCKET_NAME}/{WRITE_KEY_NAME}.gz',
                mode="wb",
                transport_params={
                    "multipart_upload": True,
                    "min_part_size": 10,
                }
        ) as fout:
            fout.write(contents)

        with smart_open.open(f's3://{BUCKET_NAME}/{WRITE_KEY_NAME}.gz', 'rb') as fin:
            actual = fin.read()

        assert actual == contents

    def test_write_gz_not_using_context_manager(self):
        """Does s3 multipart upload create a compressed file not using context manager but close()?"""
        contents = b'get ready for a surprise'
        fout = smart_open.open(
            f's3://{BUCKET_NAME}/{WRITE_KEY_NAME}.gz',
            mode="wb",
            transport_params={
                "multipart_upload": True,
                "min_part_size": 10,
            }
        )
        fout.write(contents)
        fout.close()

        with smart_open.open(f's3://{BUCKET_NAME}/{WRITE_KEY_NAME}.gz', 'rb') as fin:
            actual = fin.read()

        assert actual == contents

Versions

This looks to be affecting from v7.0.0 to current version

Please provide the output of:

Linux-6.5.7-arch1-1-x86_64-with-glibc2.38
Python 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
smart_open 7.0.4

Probably related to this PR https://github.com/piskvorky/smart_open/pull/786 here:

https://github.com/piskvorky/smart_open/pull/786/files#diff-ba9d790d5607c344a97e86f15b3706586deb24d52d52cdf70b9776e3f3412b69L103

Recommendation

Probably closing both, outer and inner in FileLikeProxy will help:


class FileLikeProxy(wrapt.ObjectProxy):
    __inner = ...  # initialized before wrapt disallows __setattr__ on certain objects
...
    def __next__(self):
        return self.__wrapped__.__next__()

    def close(self):
        try:
            self.__wrapped__.close()
        finally:
            self.__inner.close()

I can open a PR if that helps.

Thanks for reading!!

Checklist

Before you create the issue, please make sure you have:

ddelange commented 1 month ago

Sounds like indeed this case was not covered. Can you open a PR?

jbarragan-bridge commented 1 month ago

Thanks @ddelange , I created PR https://github.com/piskvorky/smart_open/pull/838 to address the issue

I am NOT sure if that is a good solution; please let me know if I should consider another approach. Thanks for all the help!!

ddelange commented 1 month ago

@jbarragan-bridge @mpenkov this issue can be closed with 7.0.5 released

jbarragan-bridge commented 1 month ago

Yes, thank you!!