Improve performance - Githubissues

masawdah commented 1 year ago

Thank you for providing this valuable package; we are currently integrating it into our workflow. We would like to suggest incorporating the following functionalities and improvements to enhance its speed and efficiency, particularly for larger datasets. These enhancements would also enable the package to handle NaNs values effectively and provide support for various statistics.

Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.
Optimize the 'subset_find' function by implementing a dictionary comprehension method to identify common subsets. This approach would significantly enhance performance, especially when dealing with big datasets, reducing processing time from hours to few seconds.
Expand the statistical capabilities by incorporating additional stats such as median, sum, etc.
Enhance NaNs handling by introducing interpolation techniques for filling missing values.

Overall, these proposed functionalities are aimed at augmenting the package's efficiency, allowing it to handle big datasets with greater speed and accuracy while providing a broader range of statistical capabilities.

Let us know if you would like to merge the proposed functionalities.

ks905383 commented 1 year ago

Thanks for the PR and your work on this! I'm especially excited for the performance updates, and I'm glad that this package is useful to folks.

Just a couple of quick questions on my first read through the proposed changes:

New statistical capabilities

could you confirm what the count evaluation does?
are the median / sum weighted the way the mean is? I'm trying to think whether including non-weighted statistics is a good idea in terms of the legibility of the package (since the primary benefit over, e.g., rasterstats is the weighting functionality) - but happy to hear your thoughts on this! Tagging #15 here as well for the discussion on the preservation of the sum over weighted overlaps...

Enhanced NaNs

What would you say is the main benefit from folding this into xagg itself, instead of encouraging folks to gap-fill the raster dataset on their own before sticking into xagg? I'm just worried, since gap-filling and interpolation is such a complex and variable-dependent issue (that has its own packages and ecosystems out there), that having one simple built-in may not be as useful or cause some unneeded dependency issues down the line (or encourage bad practices if it can be too easily outsourced to a subfunction here).

Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.

That's great! From the discussion in #47 I think it would still be good to better isolate ESMF to only when it's needed (and only import it if regridding is actually used) - but this is definitely a good fix until then, and probably after as well.

Also, you should be able to have PRs run through the tests now. I think the reason they've failed is the additional variable names from the new statistical options. That's probably another thing to discuss - changing the way that variable names are output is going to affect existing workflows, so maybe there's a way to adapt that through the existing structure - maybe keeping the old workflow of running one statistical calculation per xa.aggregate() run? (Though I understand why you'd want the ease of multiple calculations easily put into the same output).

Once we've converged on what's best to include, we can work on writing new tests for the new capabilities.

That's my first thoughts on this - I'm about to be a bit hard to reach for a few days, but I'll keep thinking about this.

masawdah commented 1 year ago

Thanks for your valid points. We would like to clarify the following:

statistical capabilities

You are right. We focused on non-weighted statistics as that what we need in our workflow. However, it would certainly be advantageous to expand it to have both options, so the user does not need to use many libraries. We can add the following options:

SUPPORTED_STATISTICS_METHODS = ["weighted_mean", "weighted_sum","weighted_median", "mean", “sum”, "max", "min", "sum", "median", "count", "std"]

We’ve already now added "weighted_sum","weighted_median" in addition to the other statistics.

Output structure

We have now adapted the code to generate output data in the new structure, including all relevant statistics and their corresponding meaningful names. We need to adjust the test code to ensure that the pull request (PR) can pass through it. We did the test locally and it passed it successfully. It is similar to the previous approach, the test is still using a single statistic as before which is the default option.

Enhanced NaNs
You are correct; indeed gap-filling is not a simple issue to handle. However, we have approached this issue from an operational standpoint, aiming to provide end users with a rapid overview of potential results through gap-filling. Integrate the package into a backend process like openEO or any other EO platform, it becomes advantageous to offer users the choice of conducting a basic gap-filling procedure (which could be meaningful with densely time-series datasets) or handle it outside the package. Also, it is important from the educational aspect as an option can also serve as a valuable learning tool for users who are still in the process of developing their understanding of gap-filling techniques.

The PR won’t pass the test for now until adjust the test code. So we’ve added a notebook shows the test locally.

let us know what you think about that.

masawdah commented 1 year ago

Hi again, We've made some modifications to the code. Now, the default option for calculations is set to weighted_mean, which is in alignment with the current output workflow. Additionally, we've included a test_core and test_export in the notebook. After modifying the attribute names to be consistent with 'weighted_mean', the code successfully passed all tests.

carolinegoehner commented 1 year ago

Hello, we realised that the weighted median could not be calculated for a single time image, so we changed the function accordingly. Regarding the count, it is dividing the sum of the pixel area covered by the polygon by the maximum pixel size to get an accurate count of covered pixels in float. We are sorry for these many comments in the last two days. As we are not in the office for the next week, we improved as much as we could. Take your time to review our additions! We are happy to answer any questions and comments as soon as we´re back in office!

masawdah commented 1 year ago

Hi @ks905383 , I hope you are doing well. I understand that everyone's schedule can be quite busy, and appreciate your time. If you've had a chance to review our changes to the PR, we're happy to hear your thoughts and feedback.

Thanks :)

bhmiel-cdphe commented 1 month ago

Hello, wanted to comment on here about the status of this branch. I'm finding myself in the position of needing to use xagg but "sum" the values of the raster within the shapefile regions (accounting for partial overlap), not calculate a weighted average. Is this feature available, at least in code somewhere?

ks905383 / xagg

Improve performance #48