performance notes with and without tree reductions etc.

o-smirnov commented 4 years ago

Continuing on from #29, just more systematically.

With tree reduction, 1e+10 points.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 10000 --bmap rainbow

Tops out at ~250Gb, runs in 145s.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 50000 --bmap rainbow

Blows out my RAM.

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

Tops out at ~70Gb, runs in 245s (but the run had competition on the box). Going to get more exact numbers from ResourceProfiler shortly.

So @sjperkins, first take: mission accomplished insofar as getting the RAM use under control. Memory-parched users can just dial their chunk size down.

o-smirnov commented 4 years ago

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 64 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

o-smirnov commented 4 years ago

Same settings but 16 colors. Note this is 4e+10 points!

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:phase --cmin -1 --cmax 1 --cnum 16 --xmin -42000 --xmax 42000 --ymin -42000 --ymax 42000 -z 1000 --bmap rainbow

707s (box was a bit competed for, though). RAM ~200Gb.

2e+10 points. I think I broken core selection though, as it plotted all four (but why not 4e+10 then? odd):

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 1000

830s. Didn't notice the RAM. :) OK time to put a proper profiler in.

sjperkins commented 4 years ago

1065s, RAM touched 300Gb but I'll take it, it's a tough plot.

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2*1000*4096*4 x 16 bytes (broadcasted MS data) + 1024*1024 * 64 * 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB Lets say fudge factor of 2 ~ 128GB

My instinct would be to check the broadcast. The fix @jskenyon put into ragavi avoided alot of extraneous communication in the graph.

sjperkins commented 4 years ago

Updated estimate to include categories in image

sjperkins commented 4 years ago

Updated again to cater for 64 categories

o-smirnov commented 4 years ago

OK, systematically now. This is a 1e+10 points plot using with minimal I/O (only UVW) and no colors. Ran for chunk size of 1000, 5000, 10000 like so:

shadems ms-4k-cal.ms -x U -y V -z 10000 -s "z{row_chunk_size}.tree" --profile

Tree vs original reduction (chunk sizes 1000, 5000, 10000):

tree reduction	original reduction

o-smirnov commented 4 years ago

shadems ms-4k-cal.ms -x U -y V -c ANTENNA1 -z 10000 -s "z{row_chunk_size}.tree" --profile

Now with 16 colours, chunk size 1000, 5000, 10000. Note how the original reduction code breaks down at lower chunk sizes:

tree reduction	original reduction

sjperkins commented 4 years ago

The task ordering looks pretty good, for 10,000 rows on minsize.ms. Colour indicates priority (in the round nodes) with red prioritised first and blue last. There are four independent colour streams, corresponding to four MS chunks

graph

Here's another view of the same graph, with task names in the round nodes.

task-names.pdf

I'm fiddling to try and get both names and priority in the same plot, its a bit difficult to decipher both at the same time.

Edit

Generated by calling the visualize method on a dask array/dataframe collection

R.visualize("graph.pdf")
R.visualize("graph.png", order="color")

o-smirnov commented 4 years ago

OK, now we make it harder. 5e+9 points, but they have to come from disk.

shadems ms-4k-cal.ms -x FREQ -y CORRECTED_DATA:XX:amp -c ANTENNA1 --cnum 64 -z 10000 -s "z{row_chunk_size}.tree" --profile

Chunk size 1000, 5000, 10000. Same picture really:

tree reduction	original reduction

o-smirnov commented 4 years ago

Another very realistic case. 5e+9 points in total, two fields iterated over.

shadems ms-4k-cal.ms -x CORRECTED_DATA:I:real -y CORRECTED_DATA:I:imag -c ANTENNA1 --cnum 64 --iter-field -z 10000 -s z{row_chunk_size}.tree --profile

Chunksize 10000, tree reduction (separate profiles for two fields):

o-smirnov commented 4 years ago

Another realistic case, two fields, UV-plot coloured by phase.

shadems ms-4k-cal.ms -x U -y V -c CORRECTED_DATA:I:phase --cmin -2 --cmax 2 --cnum 64 --iter-field -z 10000 -s "z{row_chunk_size}.tree" --profile

Chunk size 10000, 1000, tree reduction.	0408	gaincal

Chunk size 10000, original reduction. Faster but hungier:

0408	gaincal

Chunk size 1000, original reduction blew out my 512G RAM so I gave up.

Mulan-94 commented 4 years ago

@o-smirnov These look really cool. Are these last ones also by the tree reduction? Also, what is the importance of colouring by phase if I may ask?

o-smirnov commented 4 years ago

The top set is for tree reduction (sorry, editing the comment under you!), bottom set original reduction.

Phase should be ==0 on properly corrected calibrator data, so a good plot of this kind is a bland plot. The stripy pattern in the left column suggests a problem in 0408 -- most likely an unmodelled field source contributing a fringe.

Mulan-94 commented 4 years ago

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

o-smirnov commented 4 years ago

No, but @sjperkins has also been wondering...

sjperkins commented 4 years ago

Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions?

@Mulan-94 Are you referring to this kind of plot (taken from https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618532281)?

plot

If so, the climb in memory at the end is almost certainly the np.stack in the combine function of the original reduction.

Having said that, I'm bothered by this kind of pattern in the tree reduction (https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618645310).

plot

I would've hoped for a flatter heartbeat pattern, without those two peaks. I'll try block off some time next week to look at this.

JSKenyon commented 4 years ago

As a weird check, could someone try running a test using dask==2.9.1 as opposed to the latest version? While the ordering plot Simon included looks correct, I would be interested to see if the ordering changes introduced after 2.9.1 are affecting the results.

JSKenyon commented 4 years ago

Oh, and for ordering plots, I was informed of the following useful invocation:

    dask.visualize(dsk,
                   color='order',
                   cmap='autumn',
                   filename='output.pdf', 
                   node_attr={'penwidth': '10'})

It just makes the colours more obvious.

JSKenyon commented 4 years ago

So, in a weird coincidence, I was doing some profiling of CubiCalV2 this morning and noticed some very familiar features - specifcially those beautiful mountainous ramps in memory usage followed by a precipitous drop. Here are some plots: Screenshot from 2020-04-24 10-18-50 Now this genuinely bothered me as CubiCalV2 is supposed to be clean - it is almost completely functional and all results are written to disk as part of the graph. This was also distressingly similar to #359. As I did in the other issue, I tried embedding a manual garbage collector call in the graph. Behold the results: Screenshot from 2020-04-24 10-19-23 Of course, I cannot guarantee that this is the same as the features seen in some of the plots above, but it might be worth checking.

sjperkins commented 4 years ago

Yes, I'd think the next step would be to try embedding gc.collect calls in the tree reduction wrapped_combine method -- no need to hang on to MS data or leaf images beyond the chunk or wrapped_combine stages.

@o-smirnov, or @Mulan-94 would you be willing to pursue the above? Otherwise I'll look at it next week when I've cleared my stack a bit.

I dislike the idea of embedding gc.collect() calls in the graph (although this is not a criticism of the necessity to do so and figure this problem out). It may be better to tune this with gc.set_threshold, especially the thresholds for the older generations. My working hypothesis for this memory ramping behaviour is now:

we're allocating very large arrays on which we do a lot of compute.
They live for a fair amount of time
Which means they survive collection of the first generation and make it into the later generation.
Which means that subsequent (automatic) gc calls don't clear out the later generations promptly
Which means that large arrays get bumped into the even older generation.
This leads to the memory ramping -- suspect the peaks happen when the gc finally decides to clear out the oldest generation.

o-smirnov commented 4 years ago

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

@JSKenyon, garbage collection is really the answer to everything, isn't it. :)

The only time I got a flat heartbeat was in this case, top left: https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618532281

sjperkins commented 4 years ago

I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway!

Argh apologies, was assuming familiarity on the subject matter.

The generational garbage collection bits of this article might explain it quickly: https://stackify.com/python-garbage-collection/.

JSKenyon commented 4 years ago

I have played around with the threshold settings and I actually think that having manual GC calls is safer/better for applications which don't do much allocation. There are three threshold levels corresponding to different generations - lets call them t0, t1, and t2. The GC is triggered when the number of allocations minus the number of deallocations exceeds t0 (700 by default on my installation). Objects which survive collection are bumped into an older generation. When t0 has been triggered t1 times (10 by default on my installation), the GC is triggered again and examines both generation 0 and generation 1. Again, objects which survive are bumped into the third and final generation. Finally, when t1 had been triggered t2 times (also 10 by default), a complete GC is triggered. This will clear up any lingering objects. However, if you have a very low number of allocations/deallocations, big arrays on which we do lots of compute will almost certainly end up in the oldest generation. And if we don't do much allocation/deallocation, they will stay there basically indefinitely. While it is plausible to manhandle the threshold settings to alleviate the problem, I would argue that it is brittler and more prone to failure than having a single GC call at the point where cleanup is a necessity e.g. at the end of processing a chunk.

sjperkins commented 4 years ago

Yes, I think applications such as cubicalv2and shadems have more leeway in using the garbage collector as they wish.

Unfortunately, in this case, I think the optimal place to put the collect calls is in the datashader tree reduction, which is an internal API. To publish code like that in an API is a hard no to me.

Of course this is Python so we can monkeypatch everything as a workaround, within reason ;-)

JSKenyon commented 4 years ago

Ah that does make sense - if the goal is taming a dependency, then the thresholds are probably the way to go.

o-smirnov commented 4 years ago

I'm happy to monkeypatch it in for now, and if it works and solves the problem, then we discuss how and if to get it into datashader properly.

@sjperkins, where is this stack() call you speak of happening?

sjperkins commented 4 years ago

@sjperkins, where is this stack() call you speak of happening?

In datashader's combine function, there was more detail on this here:

https://github.com/ratt-ru/shadeMS/issues/29#issuecomment-616470343

The tree reduction still calls combine but in batches with far fewer arrays (roughly, the split_every parameter in dask.array.reduction).

o-smirnov commented 4 years ago

I added gc.collect() to both entry and exit of wrapped_combine, but it doesn't seem to make much of a difference (compare to top two plots in https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618645310):

o-smirnov commented 4 years ago

@sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory:	`df-factory`	previous version

sjperkins commented 4 years ago

Ah great, that looks like a factor of 4 improvement

On Tue, 2 Jun 2020, 18:49 Oleg Smirnov, notifications@github.com wrote:

@sjperkins https://github.com/sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory: df-factory previous version [image: bokeh_plot] https://user-images.githubusercontent.com/6470079/83546751-73870400-a501-11ea-94d6-b45e7be895e7.png [image: bokeh_plot(1)] https://user-images.githubusercontent.com/6470079/83546745-71bd4080-a501-11ea-9a37-a1143701cf14.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-637676504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA253ZHDD6UDSAX3QQCG5ELRUUUSLANCNFSM4MO7BKMQ .

sjperkins commented 4 years ago

Also possibly: Big-O space complexity for the win: https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-618360238

I think it may be possible to recover a couple of factors in RAM usage. Running the numbers:

2x1000x4096x4 x 16 bytes (broadcasted MS data) + 1024x1024x64 x 8 bytes (image) ~ 1012MB per thread

Multiply by 64 threads ~ 64GB Lets say fudge factor of 2 ~ 128GB

I'm not sure if the above figures are right for this:

`df-factory`	previous version

but if they are it's pretty close. It looks like shadems is running 72 threads. On average they're using 500MB each (36GB total), but at peak memory usage they're using ~1GB (75GB total).

@o-smirnov, was the above run done with 1000 rows, 4096 channels ~1024^2 image and 64 categories?

That double-peak in the memory pattern is retained in the new version. I wonder what it is? Images being aggregated to form a final value? Does datashader run through the data twice?

sjperkins commented 4 years ago

/cc'ing @rubyvanrooyen, who may also find https://github.com/ratt-ru/shadeMS/issues/34#issuecomment-638012083 useful.

o-smirnov commented 4 years ago

@sjperkins: it was 722610 rows, 4k channels, 4 corrs, 16 categories.

The double-peak is there because there's a first run-through to determine the data ranges. We can eliminate that if we fix the plot limits, but we don't always know what to fix them to. #55 will help.

sjperkins commented 4 years ago

it was 722610 rows, 4k channels, 4 corrs, 16 categories.

Ah, but what was the -z option, 1000, 10000?

o-smirnov commented 4 years ago

10000

sjperkins commented 4 years ago

Then maybe we're doing quite well. Let's remove the factor of 2 on the visibilities, because I think numpy broadcasting functionality doesn't actually expand the underlying array to full resolution: it uses 0 strides to give the impression of it. Then:

10,000 x 4,096 x 4 x 16 + 1024 x 1024 x 16 x 8 ~ max 2.5GB per thread

The memory profile is suggesting ~36GB over 72 threads in the average case (i.e. an average of ~500MB per thread) and ~75GB over 72 threads in the peak case (i.e. an average of ~1GB per thread).

I guess the visibility data doesn't stay in memory all that long -- it gets converted to an image before the tree reduction step.

All speculation, but useful to start with some sort of model and refine it.

o-smirnov commented 3 years ago

I did a few more benchmarks with and without tree reduction ("tree" versus "master") for varying problem sizes and ask chunk sizes: https://www.dropbox.com/sh/m0fch390vliqkkf/AACBcAHkHCZyzsFW3U7dnXcOa?dl=0

Observations:

Tree reduction is always better or equivalent (both in terms of RAM footprint and runtime)
On small problems, there's not much in the comparison
With small chunks (chunk size 1000), performance deteriorates sharply in all cases (but tree reduction still does better). The long problem did not complete for the "master" version, I killed it after 20 minutes (it was using >200Gb and only a single core).
RAM footprint increases with chunk size, while runtime decreases slightly. At a chunk size of 20000, RAM usage is comparable (with or without tree reduction), but with smaller chunks, tree reduction always wins.
The above use 16 categories. I also did a test with a big problem and 64 categories. This is where the non-tree-reduction version breaks apart. The chunk 1000 test ran out of memory. The rest ran, but very memory hungry:

master version	tree reduction

codeFairOfficial commented 3 years ago

To add a tip for maintainers for this issue, please visit: http://localhost:3000/payments/pay/ratt-ru/shadeMS/34

ratt-ru / shadeMS

performance notes with and without tree reductions etc. #34

Edit