gerritholl commented 2 years ago

Implement get_abs_max on the BucketResampler class.

[x] Closes #417
[x] Tests added
[x] Tests passed
[x] Passes git diff origin/main **/*py | flake8 --diff
[x] Fully documented

codecov[bot] commented 2 years ago

Codecov Report

Merging #418 (8520670) into main (70f7fbd) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #418   +/-   ##
=======================================
  Coverage   93.95%   93.96%           
=======================================
  Files          65       65           
  Lines       11269    11287   +18     
=======================================
+ Hits        10588    10606   +18     
  Misses        681      681

Flag	Coverage Δ
unittests	`93.96% <100.00%> (+<0.01%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pyresample/bucket/__init__.py	`97.95% <100.00%> (+0.10%)`	:arrow_up:
pyresample/test/test_bucket.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 70f7fbd...8520670. Read the comment docs.

coveralls commented 2 years ago

Coverage increased (+0.01%) to 93.791% when pulling 85206701584225f4440329ab7a0bfce047e4729c on gerritholl:get-abs-max into 70f7fbdc12245a64363e4a7083f18b210aeeeaa8 on pytroll:main.

gerritholl commented 2 years ago

I don't understand why this fails. It works in practice.

gerritholl commented 2 years ago

It also seems to be very slow ☹

Computing br.get_abs_max() for a full disc image takes 20.7 seconds on my machine. Even just constructing the dask graph (calling br.get_abs_max() without computing) takes 24.9 seconds, such that in total it takes around 45 seconds to get the absolute maximum for a full disc image… not great.

djhoese commented 2 years ago

I think this was @zxdawn's problem in the other PR. It is hard to get this stuff to perform well.

zxdawn commented 2 years ago

@djhoese Yes, #368. That's the slow process of calculating the min/max of group results. Maybe @dcherian has some ideas? He's developing flox which supports dask for groupby().

gerritholl commented 2 years ago

I'm using it in the parallax correction (https://github.com/pytroll/satpy/pull/1904), which, as it stands now using .get_abs_max(), is too slow to be used operationally (but I still need to check if that has other causes the .get_abs_max() being slow).

zxdawn commented 2 years ago

@gerritholl Here's the question I asked on StackOverflow. The root should be the slow process of pandas groupby(). You can test the time it costs.

dcherian commented 2 years ago

It's hard to see what's happening here but you could try something like.

flox.groupby_reduce(data, by=???, expected_groups=((0,1,2,3),), isbin=True, func="nanmax")

Saying isbin=True treats expected_groups like bin edges. It also uses pd.cut whose defaults are different from np.digitize so be careful there.

By default flox implements whatever dask.dataframe does for dask.array. So if that's working this should work too (and hopefully better).

How is this data chunked relative to the binning? Are your chunksizes big in space, small in time, and you're binning in space?
Also how many bins are in a single chunk?
The first step is a blockwise groupby-reduction. Do you expect this to significantly reduce chunksize given the number of bins, and distribution of bins in each chunk?

A minimal example would be useful. I've been trying to collect problematic usecases and eventually hope to write some "user story" like blogposts so we can do some community-wide brainstorming.

The really cool thing about flox here is you can do "custom aggregations" which stuff a lot of work in a single task (you can do this with dask.dataframe.groupby too). So you can accumulate max and min at the same time (2x fewer tasks!) and do your funny where business in the finalize step. See https://flox.readthedocs.io/en/latest/custom.html for some documentation (but really its best to just look at how things work in flox/aggregations.py

gerritholl commented 2 years ago

Thanks @dcherian for the reply! I don't have a minimal example yet. In my non-minimal real use case, I have 3712×3712 ≅ 13.7M bins. 99% of the bins have exactly one value. About 0.5% have 2 values and less than 0.1% have 3 or more values, with the largest number of values in the image being 5. Slightly less than 0.5% have 0 values. The chunks are much larger than the bins. I have no time dimension. My aim is to get the largest absolute value in each bin, or nan for bins without any values.

I suspect that my current approach is widely suboptimal, regardless of using dask or not. For the 99% I should just keep the single value, and only the ones with 2 or more should need any calculations.

I should certainly look at flox. Unfortunately, I'm neither very well versed in dask nor in pandas, so I'm not sure yet how to approach optimising my implementation (and at the same time keep it flexible and generic enough for it to fit in pyresample).

dcherian commented 2 years ago

Thanks @gerritholl what are your chunk sizes then?

My aim is to get the largest absolute value in each bin, or nan for bins without any values.

It sounds like this is mostly reindexing and a tiny amount of groupby-reduction. Are the lat, lon for the input image stored as numpy arrays i.e. do you know ahead of time which x,y pixels belong to which bin?

I'm guessing these are dask arrays (based on the documentation for bucketing). In that case, any general solution (like the one in flox) will effectively rechunk to one chunk along x,y for the output grid. Can you start with that chunking scheme instead?

I should certainly look at flox.

It's not worth it right now, grouping by multiple dask variables has not been implemented yet.

It would help if you provided some example code, I can take a look next week but no promises. I'd need the array to group, arrays that we are grouping by(lat,lon i'm guessing) and the output grid. It's OK if the stuff is being read from a file.

gerritholl commented 2 years ago

Each of the arrays for data, lat, and lon have shape (3712, 3712) and chunksize (1024, 1024), all with dtype float64.

I really appreciate your offer for assistance! I will work on some example code.

gerritholl commented 2 years ago

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

gerritholl commented 2 years ago

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

I suppose that ideally, they should pass with CustomScheduler(max_computes=0), but that's not the case for .get_max() or .get_min() either.

Seeing that the slowness is probably a consequence of .get_max() and .get_min() being slow (they share almost all code between them), which may be to the premature calculations, is this something I should fix here or may I defer this to another PR (such as #368)?

djhoese commented 2 years ago

I think that's up to @pnuu or @zxdawn and you to agree on. I'd be ok with a future PR, but would also be OK if it was part of this PR. And yes, ideally there would be zero computes inside the resampler. If there have to be then we at least want to make them intentional and retained so they only have to happen a minimal number of times and preferably don't require the user to compute the same tasks again when they compute their final result.

gerritholl commented 2 years ago

I would like to get this PR merged, so that the tests on https://github.com/pytroll/satpy/pull/1904 can pass and a first version of parallax correction might be merged into Satpy main. If this PR gets merged, I will open a pyresample issue noting that:

get_abs_max should work with 0 computes, and (related)
get_abs_max should be faster.

For both, it would be very helpful to complete https://github.com/pytroll/pyresample/pull/368.

djhoese commented 2 years ago

@zxdawn do you have time to give a formal review of this PR?

zxdawn commented 2 years ago

@djhoese sorry, I'm on the vacation in the Iceland and will back on 10th April. If I find any free time, I will check it.

djhoese commented 2 years ago

I'm on the vacation

Get off of GitHub!

zxdawn commented 2 years ago

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

gerritholl commented 2 years ago

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

Right! Clarified in the method docstring.

gerritholl commented 2 years ago

The latest tests pass if and only if the latest version of https://github.com/pytroll/pyresample/pull/368 is merged first.

gerritholl commented 2 years ago

The test passes only with #368, due to the requirement for there to be no calculations. So it would be better to merge #368 first.

pytroll / pyresample

Implement get_abs_max on BucketResampler #418

Codecov Report