Improve memory management in annotation function

omerb01 commented 4 years ago

main changes:

change dataset segments size from 5mb to 100mb (to reduce COS requests).
change images generation order from mz accending to formula_id accending (to control generated memory).
reduce memory duplications.
integrate ImagesManager class which encapsulates memory checker for formula_images. when max memory limit reached, it saves automatically all images and clear formula_images memory.
integrate cloudobject class to produce images without the necessity of keys or prefixes.
gather all PyWren executors into one member variable of Pipeline class.
add clean() method to Pipeline class to clean all PyWren temporary data.
update get_images() accordingly.

omerb01 commented 4 years ago

@LachlanStuart I have couple of updates regarding this PR: firstly, I set default segment size to 100mb instead of 5mb to reduce number of COS objects reads in annotation() - when we have 7K actions that each one reads ~10 objects, it sums up to ~70K COS requests and its actually too much to deal with (in parallel), so we experience READ TIMEOUT (this change solves this). secondly, another thing that I experience is when we pickle formula_images at the end of each action, its size get significantly bigger, for example: I tested a formula_images dict contains output images with total size of 1.5mb and it turns into 2.2GB after pickling, so my solution for that was to use npz format instead - this 1.5mb dict turns into 1.1GB after npz compression.

with these changes, on huge2 dataset, I experience that some actions don't finish their job within 10min (max time for serverless functions), after debugging I found that some actions didn't finish to generate all output images and some actions finished to generate their images successfully but it took too long to compress their output into npz format.

my questions for the next progress are:

what is the best way we can predict number of output images in each action? how can we divide it better so there won't be actions that don't finish within 10min. (maybe you can submit a PR that chooses number of centroids segments and its ranges smarter). the limitations for each action are total 2GB memory usage and total process time of 10min.
npz compression is much slower than pickle but is much more efficient, is there any other package you are familiar with that I can also test that does its job faster? why dicts of 1.5mb turns into 2.2gb or 1.1gb? is it related to coo_matrix representation ?

LachlanStuart commented 4 years ago

@omerb01 It's not possible to accurately predict the number of output images per segment. process_centr_segment already tries to use evenly-sized input segments to try to keep the processing time consistent between segments. I don't think we can improve things on the segmentation side...

Which dataset are you using, and is 2.2gb the total size, or the size of one segment's images? I can't remember any specific stats, but only a very small dataset would have 1.5mb of uncompressed images at this stage. I would expect multiple gigabytes, especially for the huge datasets. Normally we store the images as 16-bit PNGs.

numpy.savez_compressed actually just pickles & compresses formula_images because it isn't able to do anything special with dicts or coo_matrixes. The difference you see is entirely due to the "Deflate"(also called "zlib") compression that numpy.savez_compressed uses. Deflate/zlib are actually quite slow - I suggest switching to use the Zstandard library, because it compresses data better and is 3-10x faster. Here's how to use it:

from zstandard import ZstdCompressor, ZstdDecompressor

def dump_zstd(f, obj):
    with ZstdCompressor(level=1, threads=-1).stream_writer(f) as compressor:
        pickle.dump(obj, compressor, pickle.HIGHEST_PROTOCOL)

def load_zstd(f):
    with ZstdDecompressor().stream_reader(f) as compressor:
        return pickle.load(compressor)

You can get the Python bindings with pip install zstandard, but it uses a C library, so it may need to be added to the Docker image.

As another option, it would also be possible to move the saving of images into a separate action that runs after FDR calculation. This would give us the opportunity to repartition the annotations into evenly-sized segments, and it would also mean fewer total images to save (annotations with FDR > 0.5 can be discarded), but it would require adding an extra stage to the pipeline, and fully re-reading the dataset segments, which may take longer than saving/loading the images.

omerb01 commented 4 years ago

@LachlanStuart

Which dataset are you using, and is 2.2gb the total size, or the size of one segment's images?

I am working on huge2 dataset, this 1.5mb formula_images dict is related to one segment's images. I used sys.getsizeof(formula_images) to measure its memory, maybe it measures only pointers' memory rather than the actual data's memory, I was struggling to find anything else to do so, are you familiar with any other way to measure its memory? if we know the actual data size, we will probably can save the images in batches and reduce memory usage.

I suggest switching to use the Zstandard library, because it compresses data better and is 3-10x faster.

tested this package a bit, and it seems that it serialises that formula_images dict (that pickled into 2.2gb) into 1.06GB much more faster that npz (couple of seconds).

As another option, it would also be possible to move the saving of images into a separate action that runs after FDR calculation.

I think that for now, we dont need that part necessarily because we can save images in batches

omerb01 commented 4 years ago

@LachlanStuart we need to find a solution for process_centr_segment() actions which don't finish their job within 10min, do you have any related suggestions? I experience this behaviour on some actions of huge2 dataset

LachlanStuart commented 4 years ago

I used sys.getsizeof(formula_images) to measure its memory, maybe it measures only pointers' memory rather than the actual data's memory

That's right - the 1.5MB is just the size of the dict and the pointers it contains. It doesn't include the actual size of keys or values. You could do something like this to get a better estimation, but it would still be missing some of the object overhead inside the coo_matrix:

dict_size = sys.getsizeof(formula_images)
key_size = sum(sys.getsizeof(k) for k in formula_images.keys())
value_size = sum(img.data.nbytes + img.row.nbytes + img.col.nbytes for img in formula_images.values())
total_size = dict_size + key_size + value_size

we need to find a solution for process_centr_segment() actions which don't finish their job within 10min

The amount of work for each job is currently determined by the size of each database segment, configured in define_centr_segments. Specifically, these lines.

The obvious fix is to tweak this logic so that it doesn't create DB segments that are too big. The difficult part is figuring out what part of the logic needs to change...

This code assumes that the DS data and the DB data have similar distributions along the m/z spectrum, and that 50MB of dataset or 10k peaks will take roughly the same amount of time to process for all datasets. I feel that these are relatively reasonable assumptions, but I haven't tested them thoroughly.

Do you have some idea of how much variability there is between action durations? If they're all relatively consistent for this dataset (e.g. 8 minutes +/- 4 minutes), then we just need to figure out what makes this dataset run each segment slower than the other datasets, and use that factor to increase centr_segm_n by the correct amount. If they're extremely inconsistent, then I'll need to rethink how centr_segm_n is calculated, because one of the above assumptions could be wrong.

The alternative fix, which I would prefer not to do due to complexity, is to rewrite create_process_segment in such a way that it can save its partial results when it hits a time limit, then resume processing in another action. This would be the most robust, but would be really hacky...

when we have 7K actions

Is this through queueing, or are you experimenting with launching >1K actions at once in IBM Cloud?

I set default segment size to 100mb instead of 5mb to reduce number of COS objects reads in annotation() - when we have 7K actions that each one reads ~10 objects, it sums up to ~70K COS requests and its actually too much to deal with (in parallel)

Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action. If the change succeeds in Serverful METASPACE, I'll port it across to Serverless METASPACE.

omerb01 commented 4 years ago

@LachlanStuart

Do you have some idea of how much variability there is between action durations?

I tested annotate() on huge2 dataset with 500 max workers (new feature - map() has a parameter workers that defines its maximum amount of workers) and this is what I got till "raised timeout of 10min" exception: omer_timeline these results support in an assumption that annotate() actions durations are estimated equally.

last week I experienced an unexpected behavior with "out of memory" errors inside CF - actions don't finish when out of memory, now it seems that it was fixed. in addition, PyWren doesn't have the ability to detect "out of memory" errors from CF since the action is entirely killed - we are still working on a PyWren side feature that will able to do so. because of these details that I mentioned, I suspect that actions don't finish their job due to "out of memory", but I need to test it a bit more to ensure that it happens because of this. let me update you.

Is this through queueing, or are you experimenting with launching >1K actions at once in IBM Cloud?

previously, PyWren got a retry mechanism that launches actions by timestamps a couple of times (this is what I did when I launched these actions), now we got a new invoker that runs on a separate process and also able to invoke actions by a max number of workers.

Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action.

it will be a great addition if we could integrate it with this project.

LachlanStuart commented 4 years ago

@omerb01 Thanks for that plot. The action durations vary a lot more than I expected, but it seems that they generally stay within 10 minutes.

because of these details that I mentioned, I suspect that actions don't finish their job due to "out of memory",

I think you're probably right. Based on that distribution of times, it looks unlikely that an action was legitimately running for >10 minutes, given that it looks like only ~3 actions on that chart took longer than 5 minutes.

Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action.

it will be a great addition if we could integrate it with this project.

It also should reduce peak memory consumption by loading fewer segments at once. I'll prioritize copying the code here as soon as it's ready.

omerb01 commented 4 years ago

@LachlanStuart I reverted zstd to pickle again cause I experienced that it uses too much memory when it compresses objects, so I think that pickle is better since we have memory manager.

I validated this PR with big dataset and it passes successfully, haven't check it with huge2 dataset yet. ready for a review (:

omerb01 commented 4 years ago

@LachlanStuart ready for a second review

metaspace2020 / Lithops-METASPACE

Improve memory management in annotation function #47