Closed omerb01 closed 4 years ago
@LachlanStuart I have couple of updates regarding this PR:
firstly, I set default segment size to 100mb instead of 5mb to reduce number of COS objects reads in annotation()
- when we have 7K actions that each one reads ~10 objects, it sums up to ~70K COS requests and its actually too much to deal with (in parallel), so we experience READ TIMEOUT (this change solves this).
secondly, another thing that I experience is when we pickle formula_images
at the end of each action, its size get significantly bigger, for example: I tested a formula_images
dict contains output images with total size of 1.5mb and it turns into 2.2GB after pickling, so my solution for that was to use npz
format instead - this 1.5mb dict turns into 1.1GB after npz
compression.
with these changes, on huge2
dataset, I experience that some actions don't finish their job within 10min (max time for serverless functions), after debugging I found that some actions didn't finish to generate all output images and some actions finished to generate their images successfully but it took too long to compress their output into npz
format.
my questions for the next progress are:
npz
compression is much slower than pickle but is much more efficient, is there any other package you are familiar with that I can also test that does its job faster? why dicts of 1.5mb turns into 2.2gb or 1.1gb? is it related to coo_matrix
representation ?@omerb01 It's not possible to accurately predict the number of output images per segment. process_centr_segment
already tries to use evenly-sized input segments to try to keep the processing time consistent between segments. I don't think we can improve things on the segmentation side...
Which dataset are you using, and is 2.2gb the total size, or the size of one segment's images? I can't remember any specific stats, but only a very small dataset would have 1.5mb of uncompressed images at this stage. I would expect multiple gigabytes, especially for the huge datasets. Normally we store the images as 16-bit PNGs.
numpy.savez_compressed
actually just pickles & compresses formula_images
because it isn't able to do anything special with dict
s or coo_matrix
es. The difference you see is entirely due to the "Deflate"(also called "zlib") compression that numpy.savez_compressed
uses. Deflate/zlib are actually quite slow - I suggest switching to use the Zstandard library, because it compresses data better and is 3-10x faster. Here's how to use it:
from zstandard import ZstdCompressor, ZstdDecompressor
def dump_zstd(f, obj):
with ZstdCompressor(level=1, threads=-1).stream_writer(f) as compressor:
pickle.dump(obj, compressor, pickle.HIGHEST_PROTOCOL)
def load_zstd(f):
with ZstdDecompressor().stream_reader(f) as compressor:
return pickle.load(compressor)
You can get the Python bindings with pip install zstandard
, but it uses a C library, so it may need to be added to the Docker image.
As another option, it would also be possible to move the saving of images into a separate action that runs after FDR calculation. This would give us the opportunity to repartition the annotations into evenly-sized segments, and it would also mean fewer total images to save (annotations with FDR > 0.5 can be discarded), but it would require adding an extra stage to the pipeline, and fully re-reading the dataset segments, which may take longer than saving/loading the images.
@LachlanStuart
Which dataset are you using, and is 2.2gb the total size, or the size of one segment's images?
I am working on huge2
dataset, this 1.5mb formula_images
dict is related to one segment's images. I used sys.getsizeof(formula_images)
to measure its memory, maybe it measures only pointers' memory rather than the actual data's memory, I was struggling to find anything else to do so, are you familiar with any other way to measure its memory? if we know the actual data size, we will probably can save the images in batches and reduce memory usage.
I suggest switching to use the Zstandard library, because it compresses data better and is 3-10x faster.
tested this package a bit, and it seems that it serialises that formula_images
dict (that pickled into 2.2gb) into 1.06GB much more faster that npz
(couple of seconds).
As another option, it would also be possible to move the saving of images into a separate action that runs after FDR calculation.
I think that for now, we dont need that part necessarily because we can save images in batches
@LachlanStuart we need to find a solution for process_centr_segment()
actions which don't finish their job within 10min, do you have any related suggestions? I experience this behaviour on some actions of huge2
dataset
I used sys.getsizeof(formula_images) to measure its memory, maybe it measures only pointers' memory rather than the actual data's memory
That's right - the 1.5MB is just the size of the dict
and the pointers it contains. It doesn't include the actual size of keys or values. You could do something like this to get a better estimation, but it would still be missing some of the object overhead inside the coo_matrix
:
dict_size = sys.getsizeof(formula_images)
key_size = sum(sys.getsizeof(k) for k in formula_images.keys())
value_size = sum(img.data.nbytes + img.row.nbytes + img.col.nbytes for img in formula_images.values())
total_size = dict_size + key_size + value_size
we need to find a solution for process_centr_segment() actions which don't finish their job within 10min
The amount of work for each job is currently determined by the size of each database segment, configured in define_centr_segments
. Specifically, these lines.
The obvious fix is to tweak this logic so that it doesn't create DB segments that are too big. The difficult part is figuring out what part of the logic needs to change...
This code assumes that the DS data and the DB data have similar distributions along the m/z spectrum, and that 50MB of dataset or 10k peaks will take roughly the same amount of time to process for all datasets. I feel that these are relatively reasonable assumptions, but I haven't tested them thoroughly.
Do you have some idea of how much variability there is between action durations? If they're all relatively consistent for this dataset (e.g. 8 minutes +/- 4 minutes), then we just need to figure out what makes this dataset run each segment slower than the other datasets, and use that factor to increase centr_segm_n
by the correct amount. If they're extremely inconsistent, then I'll need to rethink how centr_segm_n
is calculated, because one of the above assumptions could be wrong.
The alternative fix, which I would prefer not to do due to complexity, is to rewrite create_process_segment
in such a way that it can save its partial results when it hits a time limit, then resume processing in another action. This would be the most robust, but would be really hacky...
when we have 7K actions
Is this through queueing, or are you experimenting with launching >1K actions at once in IBM Cloud?
I set default segment size to 100mb instead of 5mb to reduce number of COS objects reads in annotation() - when we have 7K actions that each one reads ~10 objects, it sums up to ~70K COS requests and its actually too much to deal with (in parallel)
Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action. If the change succeeds in Serverful METASPACE, I'll port it across to Serverless METASPACE.
@LachlanStuart
Do you have some idea of how much variability there is between action durations?
I tested annotate()
on huge2
dataset with 500 max workers (new feature - map()
has a parameter workers
that defines its maximum amount of workers) and this is what I got till "raised timeout of 10min" exception:
these results support in an assumption that annotate()
actions durations are estimated equally.
last week I experienced an unexpected behavior with "out of memory" errors inside CF - actions don't finish when out of memory, now it seems that it was fixed. in addition, PyWren doesn't have the ability to detect "out of memory" errors from CF since the action is entirely killed - we are still working on a PyWren side feature that will able to do so. because of these details that I mentioned, I suspect that actions don't finish their job due to "out of memory", but I need to test it a bit more to ensure that it happens because of this. let me update you.
Is this through queueing, or are you experimenting with launching >1K actions at once in IBM Cloud?
previously, PyWren got a retry mechanism that launches actions by timestamps a couple of times (this is what I did when I launched these actions), now we got a new invoker that runs on a separate process and also able to invoke actions by a max number of workers.
Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action.
it will be a great addition if we could integrate it with this project.
@omerb01 Thanks for that plot. The action durations vary a lot more than I expected, but it seems that they generally stay within 10 minutes.
because of these details that I mentioned, I suspect that actions don't finish their job due to "out of memory",
I think you're probably right. Based on that distribution of times, it looks unlikely that an action was legitimately running for >10 minutes, given that it looks like only ~3 actions on that chart took longer than 5 minutes.
Coincidentally, @intsco is currently working on a change that might help here. It will load segments lazily, which would mean the action would start by loading 1-2 segments, then spread the rest of the loads across the duration of the action.
it will be a great addition if we could integrate it with this project.
It also should reduce peak memory consumption by loading fewer segments at once. I'll prioritize copying the code here as soon as it's ready.
@LachlanStuart I reverted zstd
to pickle
again cause I experienced that it uses too much memory when it compresses objects, so I think that pickle
is better since we have memory manager.
I validated this PR with big
dataset and it passes successfully, haven't check it with huge2
dataset yet.
ready for a review (:
@LachlanStuart ready for a second review
main changes:
mz
accending toformula_id
accending (to control generated memory).formula_images
. when max memory limit reached, it saves automatically all images and clearformula_images
memory.cloudobject
class to produce images without the necessity of keys or prefixes.Pipeline
class.clean()
method toPipeline
class to clean all PyWren temporary data.get_images()
accordingly.