Open SchopenhauerZhang opened 4 years ago
Thank you for reporting this.
self._buf.truncate() # (1)
self._buf = io.BytesIO() # (2)
Interesting, wouldn't the garbage collector take care of this? We're throwing away the old value of self._buf
entirely in (2). Perhaps something else is holding on to a reference of the old self._buf
somewhere...
I will investigate. Thanks again.
wouldn't the garbage collector take care of this? em, i use smart_open in High concurrency and high throughput distributed system. wait the garbage collector(python's) to deal it is not good idea. I suppose to recycle the memory by self. I will try to recycle the memory by self, later i write the result at here. Thanks!
wait the garbage collector(python's) to deal it is not good idea.
If you use cpython (the standard python) there is no "wait". Memory is reclaimed as soon as it's not needed = its reference count goes to zero.
There are some exceptions with reference cycles but not leaks. I'd expect some other code (boto?) is holding on to the reference longer than it should.
Hello! Any updates on this? I'm trying to stream large files using this library but it seems that the whole file is being kept in memory before it can be written to S3, which negates my reason for wanting to use this library in the first place.
I did a memory benchmark for a 140.6 MB file using memory_profiler and it seems that the whole file was held in memory for the duration
I tried to implement the fix mentioned by OP but that doesn't seem to have prevented the memory leak
By comparison if I try to use the vanilla Python open I use much less memory
it seems that the whole file is being kept in memory before it can be written to S3
I'd be surprised if that's really the case. I think it's more likely the entire part (a single part of the multipart upload) is being kept in memory. The default part size is 50MB. You can tweak it by passing min_part_size as a transport_param (please see the README).
Also, check out https://github.com/RaRe-Technologies/smart_open/pull/547 - it has the option of buffering to disk instead of in memory. It's not ready to merge yet, unfortunately.
@cnmoise Can you reproduce the problem with the newest version of smart_open? We've reworked the buffering in the S3 submodule significantly, and it's likely the problem is no longer there.
MultipartWriter._upload_next_part Memory leak
the MultipartWriter._upload_next_part function upload the bytesIo from Memory to s3(by upload[CreateMultipartUpload]),but do not clean bytesIO (in Memory). i use the smart_open.open() (gz file) upload to s3,the Memory Keep increasing。 I check the code and focus the MultipartWriter._upload_next_part() :
after upload the bytesIO, not clean the Memory.
bug fix code
Versions
python 3.7 smart_open 2.0.0
Checklist
https://docs.python.org/zh-cn/3/library/io.html