piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.22k stars 383 forks source link

Potential memory leak in MultipartWriter._upload_next_part #511

Open SchopenhauerZhang opened 4 years ago

SchopenhauerZhang commented 4 years ago

MultipartWriter._upload_next_part Memory leak

the MultipartWriter._upload_next_part function upload the bytesIo from Memory to s3(by upload[CreateMultipartUpload]),but do not clean bytesIO (in Memory). i use the smart_open.open() (gz file) upload to s3,the Memory Keep increasing。 I check the code and focus the MultipartWriter._upload_next_part() :

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

        self._total_parts += 1
        self._buf = io.BytesIO()

after upload the bytesIO, not clean the Memory.

bug fix code

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

        self._buf.truncate()# clean the Memory , avoid Memory leak
        self._total_parts += 1
        self._buf = io.BytesIO()

Versions

python 3.7 smart_open 2.0.0

Checklist

https://docs.python.org/zh-cn/3/library/io.html

mpenkov commented 4 years ago

Thank you for reporting this.

self._buf.truncate()  # (1)
self._buf = io.BytesIO()  # (2)

Interesting, wouldn't the garbage collector take care of this? We're throwing away the old value of self._buf entirely in (2). Perhaps something else is holding on to a reference of the old self._buf somewhere...

I will investigate. Thanks again.

SchopenhauerZhang commented 4 years ago

wouldn't the garbage collector take care of this? em, i use smart_open in High concurrency and high throughput distributed system. wait the garbage collector(python's) to deal it is not good idea. I suppose to recycle the memory by self. I will try to recycle the memory by self, later i write the result at here. Thanks!

piskvorky commented 4 years ago

wait the garbage collector(python's) to deal it is not good idea.

If you use cpython (the standard python) there is no "wait". Memory is reclaimed as soon as it's not needed = its reference count goes to zero.

There are some exceptions with reference cycles but not leaks. I'd expect some other code (boto?) is holding on to the reference longer than it should.

cnmoise commented 3 years ago

Hello! Any updates on this? I'm trying to stream large files using this library but it seems that the whole file is being kept in memory before it can be written to S3, which negates my reason for wanting to use this library in the first place.

I did a memory benchmark for a 140.6 MB file using memory_profiler and it seems that the whole file was held in memory for the duration Screen Shot 2021-01-30 at 11 09 22 AM

I tried to implement the fix mentioned by OP but that doesn't seem to have prevented the memory leak

By comparison if I try to use the vanilla Python open I use much less memory Screen Shot 2021-01-30 at 11 53 23 AM

mpenkov commented 3 years ago

it seems that the whole file is being kept in memory before it can be written to S3

I'd be surprised if that's really the case. I think it's more likely the entire part (a single part of the multipart upload) is being kept in memory. The default part size is 50MB. You can tweak it by passing min_part_size as a transport_param (please see the README).

Also, check out https://github.com/RaRe-Technologies/smart_open/pull/547 - it has the option of buffering to disk instead of in memory. It's not ready to merge yet, unfortunately.

mpenkov commented 3 years ago

@cnmoise Can you reproduce the problem with the newest version of smart_open? We've reworked the buffering in the S3 submodule significantly, and it's likely the problem is no longer there.