Closed gerritholl closed 2 years ago
Merging #418 (8520670) into main (70f7fbd) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #418 +/- ##
=======================================
Coverage 93.95% 93.96%
=======================================
Files 65 65
Lines 11269 11287 +18
=======================================
+ Hits 10588 10606 +18
Misses 681 681
Flag | Coverage Δ | |
---|---|---|
unittests | 93.96% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
pyresample/bucket/__init__.py | 97.95% <100.00%> (+0.10%) |
:arrow_up: |
pyresample/test/test_bucket.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 70f7fbd...8520670. Read the comment docs.
I don't understand why this fails. It works in practice.
It also seems to be very slow ☹
Computing br.get_abs_max()
for a full disc image takes 20.7 seconds on my machine. Even just constructing the dask graph (calling br.get_abs_max()
without computing) takes 24.9 seconds, such that in total it takes around 45 seconds to get the absolute maximum for a full disc image… not great.
I think this was @zxdawn's problem in the other PR. It is hard to get this stuff to perform well.
@djhoese Yes, #368. That's the slow process of calculating the min/max of group results. Maybe @dcherian has some ideas? He's developing flox which supports dask for groupby().
I'm using it in the parallax correction (https://github.com/pytroll/satpy/pull/1904), which, as it stands now using .get_abs_max()
, is too slow to be used operationally (but I still need to check if that has other causes the .get_abs_max()
being slow).
@gerritholl Here's the question I asked on StackOverflow. The root should be the slow process of pandas groupby(). You can test the time it costs.
It's hard to see what's happening here but you could try something like.
flox.groupby_reduce(data, by=???, expected_groups=((0,1,2,3),), isbin=True, func="nanmax")
Saying isbin=True
treats expected_groups
like bin edges. It also uses pd.cut
whose defaults are different from np.digitize
so be careful there.
By default flox
implements whatever dask.dataframe does for dask.array. So if that's working this should work too (and hopefully better).
A minimal example would be useful. I've been trying to collect problematic usecases and eventually hope to write some "user story" like blogposts so we can do some community-wide brainstorming.
The really cool thing about flox
here is you can do "custom aggregations" which stuff a lot of work in a single task (you can do this with dask.dataframe.groupby too). So you can accumulate max
and min
at the same time (2x fewer tasks!) and do your funny where
business in the finalize
step. See https://flox.readthedocs.io/en/latest/custom.html for some documentation (but really its best to just look at how things work in flox/aggregations.py
Thanks @dcherian for the reply! I don't have a minimal example yet. In my non-minimal real use case, I have 3712×3712 ≅ 13.7M bins. 99% of the bins have exactly one value. About 0.5% have 2 values and less than 0.1% have 3 or more values, with the largest number of values in the image being 5. Slightly less than 0.5% have 0 values. The chunks are much larger than the bins. I have no time dimension. My aim is to get the largest absolute value in each bin, or nan for bins without any values.
I suspect that my current approach is widely suboptimal, regardless of using dask or not. For the 99% I should just keep the single value, and only the ones with 2 or more should need any calculations.
I should certainly look at flox. Unfortunately, I'm neither very well versed in dask nor in pandas, so I'm not sure yet how to approach optimising my implementation (and at the same time keep it flexible and generic enough for it to fit in pyresample).
Thanks @gerritholl what are your chunk sizes then?
My aim is to get the largest absolute value in each bin, or nan for bins without any values.
It sounds like this is mostly reindexing and a tiny amount of groupby-reduction. Are the lat, lon for the input image stored as numpy arrays i.e. do you know ahead of time which x,y pixels belong to which bin?
I'm guessing these are dask arrays (based on the documentation for bucketing). In that case, any general solution (like the one in flox) will effectively rechunk to one chunk along x,y for the output grid. Can you start with that chunking scheme instead?
I should certainly look at flox.
It's not worth it right now, grouping by multiple dask variables has not been implemented yet.
It would help if you provided some example code, I can take a look next week but no promises. I'd need the array to group, arrays that we are grouping by(lat,lon i'm guessing) and the output grid. It's OK if the stuff is being read from a file.
Each of the arrays for data, lat, and lon have shape (3712, 3712)
and chunksize (1024, 1024)
, all with dtype float64
.
I really appreciate your offer for assistance! I will work on some example code.
Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3)
, not with 2 or 1.
Likely related to the slowness is the observation that the unit test passes only with
CustomScheduler(max_computes=3)
, not with 2 or 1.
I suppose that ideally, they should pass with CustomScheduler(max_computes=0)
, but that's not the case for .get_max()
or .get_min()
either.
Seeing that the slowness is probably a consequence of .get_max()
and .get_min()
being slow (they share almost all code between them), which may be to the premature calculations, is this something I should fix here or may I defer this to another PR (such as #368)?
I think that's up to @pnuu or @zxdawn and you to agree on. I'd be ok with a future PR, but would also be OK if it was part of this PR. And yes, ideally there would be zero computes inside the resampler. If there have to be then we at least want to make them intentional and retained so they only have to happen a minimal number of times and preferably don't require the user to compute the same tasks again when they compute their final result.
I would like to get this PR merged, so that the tests on https://github.com/pytroll/satpy/pull/1904 can pass and a first version of parallax correction might be merged into Satpy main. If this PR gets merged, I will open a pyresample issue noting that:
get_abs_max
should work with 0 computes, and (related)get_abs_max
should be faster.For both, it would be very helpful to complete https://github.com/pytroll/pyresample/pull/368.
@zxdawn do you have time to give a formal review of this PR?
@djhoese sorry, I'm on the vacation in the Iceland and will back on 10th April. If I find any free time, I will check it.
I'm on the vacation
Get off of GitHub!
@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?
@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?
Right! Clarified in the method docstring.
The latest tests pass if and only if the latest version of https://github.com/pytroll/pyresample/pull/368 is merged first.
The test passes only with #368, due to the requirement for there to be no calculations. So it would be better to merge #368 first.
Implement get_abs_max on the BucketResampler class.
git diff origin/main **/*py | flake8 --diff