Improve loc_trajectory() function performance

open-AIMS / ADRIA.jl

ADRIA: Adaptive Dynamic Reef Intervention Algorithms. A multi-criteria decision support platform for informing reef restoration and adaptation interventions.

MIT License

14 stars 5 forks source link

Improve loc_trajectory() function performance #788

Closed BG-AIMS closed 1 week ago

BG-AIMS commented 3 months ago

Currently the ADRIA.loc_trajectory() function takes a long amount of time to run, particularly for GBR wide analyses. Possible speed up method is to use mapslices() function and specify the dimension to be reduced (:scenarios).

Zapiano commented 1 week ago

Quick comment about this: I don't know how much time I'm supposed to wait, but I'm trying to use ADRIA.metrics.loc_trajectory with 64 scenarios and only Moore cluster locations and it never finishes running. That's actually a problem with ADRIA.metrics.summarize.

(Update: it finished running after more than 10 mins).

But I also had a weird experience with mapslices. If, inside summarize I do

mapslices(metric, data; dims=(:scenarios,))

it works fine, but if I try

mapslices(metric, read(data); dims=(3,)

it also never finishes running.

The reason why I was thinking about doing the latter is because when I do the former, and data is a YAXArray{Float64}, the result is actually a YAXArray{Union{Float64,Missing}} (although there is no missing values).

I don't know if the problem is with JuliennedArrays, but using mapslices with YAXArarys seems to work fine (the only thing is that we need to convert the result to the right type in the end, which is annoying).

BG-AIMS commented 1 week ago

The reason why I was thinking about doing the latter is because when I do the former, and data is a YAXArray{Float64}, the result is actually a YAXArray{Union{Float64,Missing}} (although there is no missing values).

I don't know if the problem is with JuliennedArrays, but using mapslices with YAXArarys seems to work fine (the only thing is that we need to convert the result to the right type in the end, which is annoying).

Yeah that is true for mapslices() and I'm not sure why. I have been doing Float64.(mapslices(function, YAXArray, dims=[:dim])) to create YAXArray{Float64} output with one line.

ConnectedSystems commented 1 week ago

I wonder if it was the change to YAXArray that caused the performance regression - I don't recall having an issue.

That said, one of the reasons why JuliennedArrays was being used, if I remember correctly, was to lazily process higher dimensional data (e.g., 4 or 5).

I think mapslices can do this as well, but just wanted to confirm that this is true?

Zapiano commented 1 week ago

I wonder if it was the change to YAXArray that caused the performance regression - I don't recall having an issue.

Actually, I think you might be right. The current approach (that is not working) is:

data_slices = JuliennedArrays.Slices(data.data, alongs...)
summarized_data = map(metric, data_slices)

But if we change to:

data_slices = JuliennedArrays.Slices(read(data), alongs...)
summarized_data = map(metric, data_slices)

it's actually waaaay better than the mapslices approach !

ConnectedSystems commented 1 week ago

mapslices with YAXArarys seems to work fine (the only thing is that we need to convert the result to the right type in the end, which is annoying).

There's a mapslices method defined for YAXArray types, so likely it's optimised.

https://juliadatacubes.github.io/YAXArrays.jl/dev/UserGuide/compute.html#mapslices

it's actually waaaay better than the mapslices approach !

Hmmmmm.... the issue then is if we happen to be working with a very large dataset. It might take longer (or crash with out of memory error) because you're reading the entire dataset into memory.

Is this likely ever an issue?

An alternative is to do make the code smartly switch between read() + map or just mapslices - get the estimated data size from the YAXArray and read() if it's below available memory, or use mapslices on data.data if not...