formula_images memory manager issue

omerb01 commented 4 years ago

https://github.com/metaspace2020/pywren-annotation-pipeline/blob/095b4dcce9141b0f530e94fd163fe3bf1447ea52/annotation_pipeline/image.py#L39

I haven't succeeded to figure out why yet, but it seems that something wrong with the memory manager when it clears formula_images dict. I know that python's garbage collector works from time to time and frees unreachable pointers, so I even tried to execute it explicitly by:

import gc
gc.collect()

and it still shows the same output.

Example activation log of annotate() ("memory1" is action's memory before data is cleared and "memory2" is action's memory after data is cleared - used by pywren_ibm_cloud.utils.get_current_memory_usage()):

    "2020-01-09T15:27:50.528049Z    stdout: ---------------------- FUNCTION LOG ----------------------",
    "2020-01-09T15:27:50.640300Z    stdout: memory: 138.9MiB",
    "2020-01-09T15:27:50.640326Z    stdout: Reading centroids segment metabolomics/tmp/centroids_segments/2341.msgpack",
    "2020-01-09T15:27:50.640972Z    stderr: /action/pywren_ibm_cloud/runtime/function_handler/jobrunner.py:253: FutureWarning: The read_msgpack is deprecated and will be removed in a future version.",
    "2020-01-09T15:27:50.640982Z    stderr: It is recommended to use pyarrow for on-the-wire transmission of pandas objects.",
    "2020-01-09T15:27:50.640987Z    stderr: result = function(**data)",
    "2020-01-09T15:27:50.760181Z    stdout: memory: 141.2MiB",
    "2020-01-09T15:27:50.766257Z    stdout: Reading dataset segments 54-55",
    "2020-01-09T15:27:54.926716Z    stdout: memory: 408.2MiB",
    "2020-01-09T15:27:54.926784Z    stdout: max_formula_images_mb: 540",
    "2020-01-09T15:28:52.023948Z    stdout: memory1: 962.3MiB",
    "2020-01-09T15:28:52.023987Z    stdout: Saving 265 images",
    "2020-01-09T15:28:52.023992Z    stdout: formula_images_size :  568481928",
    "2020-01-09T15:29:08.895328Z    stdout: [DEBUG] pywren_ibm_cloud.storage.backends.ibm_cos.ibm_cos: PUT Object pywren.jobs/2ea8ad/14/M000/00000/data_0.pickle - Size: 542.3MiB - OK",
    "2020-01-09T15:29:09.172729Z    stdout: memory2: 1.5GiB",
    "2020-01-09T15:29:47.074029Z    stdout: memory1: 1.5GiB",
    "2020-01-09T15:29:47.074158Z    stdout: Saving 262 images",
    "2020-01-09T15:29:47.074237Z    stdout: formula_images_size :  566705464",
    "2020-01-09T15:30:04.096653Z    stdout: [DEBUG] pywren_ibm_cloud.storage.backends.ibm_cos.ibm_cos: PUT Object pywren.jobs/2ea8ad/14/M000/00000/data_1.pickle - Size: 540.6MiB - OK",
    "2020-01-09T15:30:04.375794Z    stdout: memory2: 1.5GiB",
    "2020-01-09T15:30:23.727646Z    stdout: memory1: 1.5GiB",
    "2020-01-09T15:30:23.727686Z    stdout: Saving 142 images",
    "2020-01-09T15:30:23.727757Z    stdout: formula_images_size :  274837080",
    "2020-01-09T15:30:32.186205Z    stdout: [DEBUG] pywren_ibm_cloud.storage.backends.ibm_cos.ibm_cos: PUT Object pywren.jobs/2ea8ad/14/M000/00000/data_2.pickle - Size: 262.2MiB - OK",
    "2020-01-09T15:30:32.207781Z    stdout: Centroids segment metabolomics/tmp/centroids_segments/2341.msgpack finished",
    "2020-01-09T15:30:32.224223Z    stdout: ----------------------------------------------------------",

in the example, all "memory2" records should be around 1GB instead of 1.5GB

LachlanStuart commented 4 years ago

The extra 0.5GiB used between memory1 and memory2 is probably caused by the temporary buffer used for pickling inside self._internal_storage.put_object. It's unfortunate that this temporary buffer is needed, but IIRC ibm_boto3 doesn't provide an API for streaming writes.

Python often holds on to memory, instead of releasing it back to the OS, so that it can quickly allocate memory again without having to call the OS to give it more memory. I think that's the most likely explanation for why those logs show memory increasing to 1.5GiB and then staying there.

omerb01 commented 4 years ago

@LachlanStuart these spare 0.5GiB could be invested in formula images and reduce COS requests number. do you have a suggestion to use it better? how to force python to release this memory?

LachlanStuart commented 4 years ago

@omerb01 Python releasing the memory won't help. After the formula images are cleared and the temporary buffer for pickling is gone, Python just holds the empty memory so that it doesn't need to waste time returning it to the OS and then re-requesting it from the OS later.

The issue is that during saving (inside internal_storage.put_object) , it currently needs memory for both the in-memory images, and the temporary buffer for pickling. This probably isn't a firm requirement though - pickle can output in a streaming way using pickle.dump instead of pickle.dumps, which requires very little extra memory. The issue is that AFAICT ibm_boto3 doesn't have an easy API to receive the data in a streaming way and upload it to COS.

The ibm_s3transfer package, used internally by ibm_boto3, seems to have a UploadNonSeekableInputManager class for this purpose, which can be accessed by calling cos_client.upload_fileobj with a readable, non-seekable file-like object as an input. I have no idea how to actually use it here though. pickle.dump wants some sort of writable file-like object and ibm_s3transfer.manager.TransferManager.upload wants some sort of readable file-like object. I can't see anything in the Python standard library for bridging a writer and a reader...

omerb01 commented 4 years ago

@LachlanStuart I see... at the begging of the project, we needed to use COS with file-like-objects too, and I found an open-source GitHub repository which may help, it's called smart_open https://github.com/RaRe-Technologies/smart_open This package makes some sort of a wrapper for s3 clients to use it by file-like-objects with special open() method as much like built-in open() for local files. are you familiar with this package? I'm not sure if it will help in this case.

LachlanStuart commented 4 years ago

@omerb01 I haven't seen that before, but it looks like it would work. It seems to work by making a writable object that starts a multi-part upload and pushes a new part every time enough data has been written to it.

omerb01 commented 4 years ago

@LachlanStuart seems that it didn't solve the issue, I used smart_open inside save_images() of the manager's class:

from smart_open import open
ibm_cos_session = ibm_boto3.session.Session(aws_access_key_id='****',
                                            aws_secret_access_key='****')
transport_params = {
    'session': ibm_cos_session,
    'resource_kwargs': {'endpoint_url': 'https://s3.****.cloud-object-storage.appdomain.cloud'}
}
with open(f's3://{bucket}/{key}', 'wb', transport_params=transport_params) as data_stream:
    pickle.dump(self.formula_images, data_stream)

example of an activation log: (memory1 was measured before saving, memory2 was measured after clearing self.formula_images)

    "2020-01-15T14:22:19.856601Z    stdout: ---------------------- FUNCTION LOG ----------------------",
    "2020-01-15T14:22:19.856606Z    stdout: Reading centroids segment metabolomics/tmp/centroids_segments/2341.msgpack",
    "2020-01-15T14:22:19.856954Z    stderr: /action/pywren_ibm_cloud/runtime/function_handler/jobrunner.py:253: FutureWarning: The read_msgpack is deprecated and will be removed in a future version.",
    "2020-01-15T14:22:19.856960Z    stderr: It is recommended to use pyarrow for on-the-wire transmission of pandas objects.",
    "2020-01-15T14:22:19.856964Z    stderr: result = function(**data)",
    "2020-01-15T14:22:19.870137Z    stdout: Reading dataset segments 54-55",
    "2020-01-15T14:22:24.386184Z    stdout: max_formula_images_mb: 412",
    "2020-01-15T14:22:58.793623Z    stdout: memory1:  764.7MiB",
    "2020-01-15T14:22:58.793657Z    stdout: Saving 196 images",
    "2020-01-15T14:23:10.943005Z    stdout: memory2:  1.2GiB",
    "2020-01-15T14:23:32.471742Z    stdout: memory1:  1.2GiB",
    "2020-01-15T14:23:32.471892Z    stdout: Saving 173 images",
    "2020-01-15T14:23:45.249771Z    stdout: memory2:  1.2GiB",
    "2020-01-15T14:24:04.823081Z    stdout: memory1:  1.2GiB",
    "2020-01-15T14:24:04.823113Z    stdout: Saving 171 images",
    "2020-01-15T14:24:18.239402Z    stdout: memory2:  1.2GiB",
    "2020-01-15T14:24:23.985985Z    stdout: memory1:  1.2GiB",
    "2020-01-15T14:24:23.986018Z    stdout: Saving 44 images",
    "2020-01-15T14:24:26.765940Z    stdout: Centroids segment metabolomics/tmp/centroids_segments/2341.msgpack finished",
    "2020-01-15T14:24:26.777186Z    stdout: ----------------------------------------------------------",

LachlanStuart commented 4 years ago

@omerb01 I dug a bit deeper and found out a few frustrating things about Python's pickler:

Python 3.6 and 3.7's C implementation of pickle.dump never flushes its temporary output buffer - instead it just makes a single call to write with the entire file. This means it always allocates a large temporary buffer (approximately the same size as the input data) to hold the output.
- The pure-Python implementation (pickle._dump) doesn't have this problem - it does MANY small writes instead, which is a bit slower but is memory efficient.
- Python 3.8's C implementation also doesn't have this problem - it tries to flush its temporary buffer every 64kiB, so it doesn't get too big.
Numpy arrays (used internally by coo_matrix) create a temporary copy of their data when they're pickled, effectively doubling their memory usage while that array exists. Pickle implementations "memoize" this temporary copy, meaning that it can't be GC'd until the pickle process is finished.
- All implementations of the pickler have a "fast mode", which turns off memoization, meaning that the temporary copies can be GC'd immediately after being written. However, fast mode doesn't support recursive objects, and is less efficient if objects are repeated multiple times in the pickle. Also, it's officially deprecated :(

Here's my test code in case you want to try it:

import pickle, resource, numpy as np
from scipy.sparse import coo_matrix
class BlackHole:
    def __init__(self):
        self.cnt = 0
        self.biggest = 0
        self.total = 0
    def __del__(self):
        print(f'writes: {self.cnt} biggest: {self.biggest} total: {self.total}')
    def write(self, bytes):
        self.cnt += 1
        self.biggest = max(self.biggest, len(bytes))
        self.total += len(bytes)

# big_dict = dict((i, coo_matrix(np.arange(10000))) for i in range(10000)) # coo_matrixes
big_dict = dict((i, np.arange(10000)) for i in range(10000)) # numpy arrays
# big_dict = dict((i, list(range(10000))) for i in range(10000)) # pure Python objects

print(f'Max memory usage before: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss} kiB')
# Uncomment one of the below implementations

# normal pickle (C implementation)
# pickle.dump(big_dict, BlackHole())

# Python implementation
# pickle._dump(big_dict, BlackHole())

# C implementation ("fast mode")
# p = pickle.Pickler(BlackHole())
# p.fast = True
# p.dump(big_dict)
# del p # needed to trigger BlackHole.__del__

# Python implementation ("fast mode")
# p = pickle._Pickler(BlackHole())
# p.fast = True
# p.dump(big_dict)
# del p # needed to trigger BlackHole.__del__

print(f'Max memory usage after: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss} kiB')

Note that the "Max memory usage" metric can't be reset, so Python should be restarted every time after this test.

Currently the code needs memory for 3x the size of formula_images: 1x for the original formula_images instance, 1x for the output buffer due to the "never flushing the output buffer" bug, and 1x due to the "numpy copies" bug

To fix the "never flushing the output buffer" problem, there are these options:

Do nothing - keep using small formula image chunks
Switch to the Python implementation of the pickler, which takes ~4x as long to run for a dict of coo_matrixs
Upgrade pywren-annotation-pipeline to Python 3.8

To fix the "numpy copies" problem, the options are:

Do nothing - keep using small formula image chunks
Use "fast mode", even though it's officially deprecated
Subclass the Pickler and implement custom logic to be smarter when pickling numpy arrays

For these steps forward, I think we also need to decide which solutions should be implemented in PyWren's put_object, and which should be implemented specific to this code. I'm sure other PyWren users would benefit from being able to pickle with less memory usage, but there's a question of how hacky we're willing to go...

LachlanStuart commented 4 years ago

The "numpy copies" problem has actually already been reported to Python, but it unfortunately seems to be stuck in PR: https://github.com/python/cpython/pull/13036

JosepSampe commented 4 years ago

@LachlanStuart based on your script above, I created a script that prints the actual memory peak when doing an operation, for example, pickling objects: https://gist.github.com/JosepSampe/25d2f1bdf8250ec56f4e739d8c2b4e6e

Based on the results, seems that using fast mode in Python3.7 & Python3.8 (either C and Python) does not have extra memory consumption:

Python3.6

Allocating source data...
=> peak memory usage: 1.660 GB

Dumping (C)...
writes: 1 biggest: 1602049329 total: 1602049329
done in 1.669s
=> peak memory usage: 4.836 GB

Dumping (Python)...
writes: 880076 biggest: 79997 total: 1602049327
done in 3.208s
=> peak memory usage: 3.300 GB

Dumping (C Fast mode)...
writes: 1 biggest: 1604829770 total: 1604829770
done in 1.287s
=> peak memory usage: 3.274 GB

Dumping (Python fast mode)...
writes: 1150023 biggest: 79997 total: 1604829768
done in 3.830s
=> peak memory usage: 1.670 GB

Python3.7

Allocating source data...
=> peak memory usage: 1.661 GB

Dumping (C)...
writes: 20001 biggest: 80381 total: 1602049329
done in 1.114s
=> peak memory usage: 3.281 GB

Dumping (Python)...
writes: 890076 biggest: 79992 total: 1602049327
done in 3.065s
=> peak memory usage: 3.300 GB

Dumping (C Fast mode)...
writes: 20001 biggest: 80493 total: 1604829770
done in 0.680s
=> peak memory usage: 1.662 GB

Dumping (Python fast mode)...
writes: 1160023 biggest: 79992 total: 1604829768
done in 3.656s
=> peak memory usage: 1.662 GB

Python3.8

Allocating source data...
=> peak memory usage: 1.665 GB

Dumping (C)...
writes: 30001 biggest: 80264 total: 1601390005
done in 1.120s
=> peak memory usage: 3.283 GB

Dumping (Python)...
writes: 60003 biggest: 80253 total: 1601390003
done in 2.897s
=> peak memory usage: 3.296 GB

Dumping (C Fast mode)...
writes: 30001 biggest: 80356 total: 1604849779
done in 0.716s
=> peak memory usage: 1.687 GB

Dumping (Python fast mode)...
writes: 60003 biggest: 80347 total: 1604849777
done in 3.908s
=> peak memory usage: 1.687 GB

@LachlanStuart This is not the same you stated in the previous comment, so can you confirm this? In my case pickle.dump() performs only one write in Python3.6 and much more writes in Python3.7 & Python3.8

LachlanStuart commented 4 years ago

@JosepSampe I don't have the Python3.7 environment that I used for that experiment anymore. It's possible I wasn't on the latest sub-version (3.7.6) because I just grabbed an existing environment that was already set up.

I've re-tested on 3.7.6 and I got the same results as you.

gilv commented 4 years ago

@omerb01 what is the status of it?

omerb01 commented 4 years ago

@gilv solving this issue requires to move into Python 3.7

metaspace2020 / Lithops-METASPACE

formula_images memory manager issue #55