ks905383 / xagg

Aggregating gridded data (xarray) to polygons
https://xagg.readthedocs.io/
GNU General Public License v3.0
82 stars 15 forks source link

Improve performance #48

Open masawdah opened 1 year ago

masawdah commented 1 year ago

Thank you for providing this valuable package; we are currently integrating it into our workflow. We would like to suggest incorporating the following functionalities and improvements to enhance its speed and efficiency, particularly for larger datasets. These enhancements would also enable the package to handle NaNs values effectively and provide support for various statistics.

  1. Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.
  2. Optimize the 'subset_find' function by implementing a dictionary comprehension method to identify common subsets. This approach would significantly enhance performance, especially when dealing with big datasets, reducing processing time from hours to few seconds.
  3. Expand the statistical capabilities by incorporating additional stats such as median, sum, etc.
  4. Enhance NaNs handling by introducing interpolation techniques for filling missing values.

Overall, these proposed functionalities are aimed at augmenting the package's efficiency, allowing it to handle big datasets with greater speed and accuracy while providing a broader range of statistical capabilities.

Let us know if you would like to merge the proposed functionalities.

ks905383 commented 1 year ago

Thanks for the PR and your work on this! I'm especially excited for the performance updates, and I'm glad that this package is useful to folks.

Just a couple of quick questions on my first read through the proposed changes:

New statistical capabilities

Enhanced NaNs

Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.

That's great! From the discussion in #47 I think it would still be good to better isolate ESMF to only when it's needed (and only import it if regridding is actually used) - but this is definitely a good fix until then, and probably after as well.

Also, you should be able to have PRs run through the tests now. I think the reason they've failed is the additional variable names from the new statistical options. That's probably another thing to discuss - changing the way that variable names are output is going to affect existing workflows, so maybe there's a way to adapt that through the existing structure - maybe keeping the old workflow of running one statistical calculation per xa.aggregate() run? (Though I understand why you'd want the ease of multiple calculations easily put into the same output).

Once we've converged on what's best to include, we can work on writing new tests for the new capabilities.

That's my first thoughts on this - I'm about to be a bit hard to reach for a few days, but I'll keep thinking about this.

masawdah commented 1 year ago

Thanks for your valid points. We would like to clarify the following:

statistical capabilities

SUPPORTED_STATISTICS_METHODS = ["weighted_mean", "weighted_sum","weighted_median", "mean", “sum”, "max", "min", "sum", "median", "count", "std"]

We’ve already now added "weighted_sum","weighted_median" in addition to the other statistics.

Output structure

The PR won’t pass the test for now until adjust the test code. So we’ve added a notebook shows the test locally.

let us know what you think about that.

masawdah commented 1 year ago

Hi again, We've made some modifications to the code. Now, the default option for calculations is set to weighted_mean, which is in alignment with the current output workflow. Additionally, we've included a test_core and test_export in the notebook. After modifying the attribute names to be consistent with 'weighted_mean', the code successfully passed all tests.

carolinegoehner commented 1 year ago

Hello, we realised that the weighted median could not be calculated for a single time image, so we changed the function accordingly. Regarding the count, it is dividing the sum of the pixel area covered by the polygon by the maximum pixel size to get an accurate count of covered pixels in float. We are sorry for these many comments in the last two days. As we are not in the office for the next week, we improved as much as we could. Take your time to review our additions! We are happy to answer any questions and comments as soon as we´re back in office!

masawdah commented 1 year ago

Hi @ks905383 , I hope you are doing well. I understand that everyone's schedule can be quite busy, and appreciate your time. If you've had a chance to review our changes to the PR, we're happy to hear your thoughts and feedback.

Thanks :)