Open o-smirnov opened 4 years ago
shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow
1065s, RAM touched 300Gb but I'll take it, it's a tough plot.
Same settings but 16 colors. Note this is 4e+10 points!
shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 16 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow
707s (box was a bit competed for, though). RAM ~200Gb.
2e+10 points. I think I broken core selection though, as it plotted all four (but why not 4e+10 then? odd):
shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 1000
830s. Didn't notice the RAM. :) OK time to put a proper profiler in.
1065s, RAM touched 300Gb but I'll take it, it's a tough plot.
I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:
2*1000*4096*4 x 16 bytes (broadcasted MS data) + 1024*1024 * 64 * 8 bytes (image) ~ 1012MB per thread
Multiply by 64 threads ~ 64GB Lets say fudge factor of 2 ~ 128GB
My instinct would be to check the broadcast. The fix @jskenyon put into ragavi avoided alot of extraneous communication in the graph.
Updated estimate to include categories in image
Updated again to cater for 64 categories
OK, systematically now. This is a 1e+10 points plot using with minimal I/O (only UVW) and no colors. Ran for chunk size of 1000, 5000, 10000 like so:
shadems ms-4k-cal.ms -x U -y V -z 10000 -s "z{row_chunk_size}.tree" --profile
Tree vs original reduction (chunk sizes 1000, 5000, 10000):
tree reduction | original reduction |
---|---|
shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 -z 10000 -s "z{row_chunk_size}.tree" --profile
Now with 16 colours, chunk size 1000, 5000, 10000. Note how the original reduction code breaks down at lower chunk sizes:
tree reduction | original reduction |
---|---|
The task ordering looks pretty good, for 10,000 rows on minsize.ms. Colour indicates priority (in the round nodes) with red prioritised first and blue last. There are four independent colour streams, corresponding to four MS chunks
Here's another view of the same graph, with task names in the round nodes.
I'm fiddling to try and get both names and priority in the same plot, its a bit difficult to decipher both at the same time.
Generated by calling the visualize method on a dask array/dataframe collection
R.visualize("graph.pdf")
R.visualize("graph.png", order="color")
OK, now we make it harder. 5e+9 points, but they have to come from disk.
shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 10000 -s "z{row_chunk_size}.tree" --profile
Chunk size 1000, 5000, 10000. Same picture really:
tree reduction | original reduction |
---|---|
Another very realistic case. 5e+9 points in total, two fields iterated over.
shadems ms-4k-cal.ms -x CORRECTED_DATA:I:real -y CORRECTED_DATA:I:imag -c ANTENNA1 --cnum 64 --iter-field -z 10000 -s z{row_chunk_size}.tree --profile
Chunksize 10000, tree reduction (separate profiles for two fields): | |
---|---|
Another realistic case, two fields, UV-plot coloured by phase.
shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:I:phase --cmin -2 --cmax 2 --cnum 64 --iter-field -z 10000 -s "z{row_chunk_size}.tree" --profile
Chunk size 10000, 1000, tree reduction. | 0408 | gaincal |
---|---|---|
Chunk size 10000, original reduction. Faster but hungier:
0408 | gaincal |
---|---|
Chunk size 1000, original reduction blew out my 512G RAM so I gave up.
@o-smirnov These look really cool. Are these last ones also by the tree reduction? Also, what is the importance of colouring by phase if I may ask?
The top set is for tree reduction (sorry, editing the comment under you!), bottom set original reduction.
Phase should be ==0 on properly corrected calibrator data, so a good plot of this kind is a bland plot. The stripy pattern in the left column suggests a problem in 0408 -- most likely an unmodelled field source contributing a fringe.
Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?
No, but @sjperkins has also been wondering...
Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?
@Mulan-94 Are you referring to this kind of plot (taken from https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618532281)?
If so, the climb in memory at the end is almost certainly the np.stack
in the combine function of the original reduction.
Having said that, I'm bothered by this kind of pattern in the tree reduction (https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618645310).
I would've hoped for a flatter heartbeat pattern, without those two peaks. I'll try block off some time next week to look at this.
As a weird check, could someone try running a test using dask==2.9.1 as opposed to the latest version? While the ordering plot Simon included looks correct, I would be interested to see if the ordering changes introduced after 2.9.1 are affecting the results.
Oh, and for ordering plots, I was informed of the following useful invocation:
dask.visualize(dsk,
color='order',
cmap='autumn',
filename='output.pdf',
node_attr={'penwidth': '10'})
It just makes the colours more obvious.
So, in a weird coincidence, I was doing some profiling of CubiCalV2 this morning and noticed some very familiar features - specifcially those beautiful mountainous ramps in memory usage followed by a precipitous drop. Here are some plots: Now this genuinely bothered me as CubiCalV2 is supposed to be clean - it is almost completely functional and all results are written to disk as part of the graph. This was also distressingly similar to #359. As I did in the other issue, I tried embedding a manual garbage collector call in the graph. Behold the results: Of course, I cannot guarantee that this is the same as the features seen in some of the plots above, but it might be worth checking.
Yes, I'd think the next step would be to try embedding gc.collect
calls in the tree reduction wrapped_combine
method -- no need to hang on to MS data or leaf images beyond the chunk
or wrapped_combine
stages.
@o-smirnov, or @Mulan-94 would you be willing to pursue the above? Otherwise I'll look at it next week when I've cleared my stack a bit.
I dislike the idea of embedding gc.collect()
calls in the graph (although this is not a criticism of the necessity to do so and figure this problem out). It may be better to tune this with gc.set_threshold, especially the thresholds for the older generations. My working hypothesis for this memory ramping behaviour is now:
I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!
@JSKenyon, garbage collection is really the answer to everything, isn't it. :)
The only time I got a flat heartbeat was in this case, top left: https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618532281
I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!
Argh apologies, was assuming familiarity on the subject matter.
The generational garbage collection bits of this article might explain it quickly: https://stackify.com/python-garbage-collection/.
I have played around with the threshold settings and I actually think that having manual GC calls is safer/better for applications which don't do much allocation. There are three threshold levels corresponding to different generations - lets call them t0
, t1
, and t2
. The GC is triggered when the number of allocations minus the number of deallocations exceeds t0
(700 by default on my installation). Objects which survive collection are bumped into an older generation. When t0
has been triggered t1
times (10 by default on my installation), the GC is triggered again and examines both generation 0 and generation 1. Again, objects which survive are bumped into the third and final generation. Finally, when t1
had been triggered t2
times (also 10 by default), a complete GC is triggered. This will clear up any lingering objects. However, if you have a very low number of allocations/deallocations, big arrays on which we do lots of compute will almost certainly end up in the oldest generation. And if we don't do much allocation/deallocation, they will stay there basically indefinitely. While it is plausible to manhandle the threshold settings to alleviate the problem, I would argue that it is brittler and more prone to failure than having a single GC call at the point where cleanup is a necessity e.g. at the end of processing a chunk.
Yes, I think applications such as cubicalv2and shadems have more leeway in using the garbage collector as they wish.
Unfortunately, in this case, I think the optimal place to put the collect calls is in the datashader tree reduction, which is an internal API. To publish code like that in an API is a hard no to me.
Of course this is Python so we can monkeypatch everything as a workaround, within reason ;-)
Ah that does make sense - if the goal is taming a dependency, then the thresholds are probably the way to go.
I'm happy to monkeypatch it in for now, and if it works and solves the problem, then we discuss how and if to get it into datashader properly.
@sjperkins, where is this stack() call you speak of happening?
@sjperkins, where is this stack() call you speak of happening?
In datashader's combine
function, there was more detail on this here:
https://github.com/ratt-ru/shadeMS/issues/29#issuecomment-616470343
The tree reduction still calls combine but in batches with far fewer arrays (roughly, the split_every parameter in dask.array.reduction).
I added gc.collect()
to both entry and exit of wrapped_combine
, but it doesn't seem to make much of a difference (compare to top two plots in https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618645310):
@sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory: | df-factory |
previous version |
---|---|---|
Ah great, that looks like a factor of 4 improvement
On Tue, 2 Jun 2020, 18:49 Oleg Smirnov, notifications@github.com wrote:
@sjperkins https://github.com/sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory: df-factory previous version [image: bokeh_plot] https://user-images.githubusercontent.com/6470079/83546751-73870400-a501-11ea-94d6-b45e7be895e7.png [image: bokeh_plot(1)] https://user-images.githubusercontent.com/6470079/83546745-71bd4080-a501-11ea-9a37-a1143701cf14.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-637676504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA253ZHDD6UDSAX3QQCG5ELRUUUSLANCNFSM4MO7BKMQ .
Also possibly: Big-O space complexity for the win: https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618360238
I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:
2x1000x4096x4 x 16 bytes (broadcasted MS data) + 1024x1024x64 x 8 bytes (image) ~ 1012MB per thread
Multiply by 64 threads ~ 64GB Lets say fudge factor of 2 ~ 128GB
I'm not sure if the above figures are right for this:
df-factory |
previous version |
---|---|
but if they are it's pretty close. It looks like shadems is running 72 threads. On average they're using 500MB each (36GB total), but at peak memory usage they're using ~1GB (75GB total).
@o-smirnov, was the above run done with 1000 rows, 4096 channels ~1024^2 image and 64 categories?
That double-peak in the memory pattern is retained in the new version. I wonder what it is? Images being aggregated to form a final value? Does datashader run through the data twice?
/cc'ing @rubyvanrooyen, who may also find https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-638012083 useful.
@sjperkins: it was 722610 rows, 4k channels, 4 corrs, 16 categories.
The double-peak is there because there's a first run-through to determine the data ranges. We can eliminate that if we fix the plot limits, but we don't always know what to fix them to. #55 will help.
it was 722610 rows, 4k channels, 4 corrs, 16 categories.
Ah, but what was the -z option, 1000, 10000?
10000
Then maybe we're doing quite well. Let's remove the factor of 2 on the visibilities, because I think numpy broadcasting functionality doesn't actually expand the underlying array to full resolution: it uses 0 strides to give the impression of it. Then:
10,000 x 4,096 x 4 x 16 + 1024 x 1024 x 16 x 8 ~ max 2.5GB per thread
The memory profile is suggesting ~36GB over 72 threads in the average case (i.e. an average of ~500MB per thread) and ~75GB over 72 threads in the peak case (i.e. an average of ~1GB per thread).
I guess the visibility data doesn't stay in memory all that long -- it gets converted to an image before the tree reduction step.
All speculation, but useful to start with some sort of model and refine it.
I did a few more benchmarks with and without tree reduction ("tree" versus "master") for varying problem sizes and ask chunk sizes: https://www.dropbox.com/sh/m0fch390vliqkkf/AACBcAHkHCZyzsFW3U7dnXcOa?dl=0
Observations:
Tree reduction is always better or equivalent (both in terms of RAM footprint and runtime)
On small problems, there's not much in the comparison
With small chunks (chunk size 1000), performance deteriorates sharply in all cases (but tree reduction still does better). The long problem did not complete for the "master" version, I killed it after 20 minutes (it was using >200Gb and only a single core).
RAM footprint increases with chunk size, while runtime decreases slightly. At a chunk size of 20000, RAM usage is comparable (with or without tree reduction), but with smaller chunks, tree reduction always wins.
The above use 16 categories. I also did a test with a big problem and 64 categories. This is where the non-tree-reduction version breaks apart. The chunk 1000 test ran out of memory. The rest ran, but very memory hungry:
master version | tree reduction |
---|---|
To add a tip for maintainers for this issue, please visit: http://localhost:3000/payments/pay/ratt-ru/shadeMS/34
Continuing on from #29, just more systematically.
With tree reduction, 1e+10 points.
Tops out at ~250Gb, runs in 145s.
Blows out my RAM.
Tops out at ~70Gb, runs in 245s (but the run had competition on the box). Going to get more exact numbers from ResourceProfiler shortly.
So @sjperkins, first take: mission accomplished insofar as getting the RAM use under control. Memory-parched users can just dial their chunk size down.