tobac-project / tobac

Tracking and object-based analysis of clouds
BSD 3-Clause "New" or "Revised" License
98 stars 53 forks source link

v1.x and v2.x harmony and the transition to xarray #143

Open freemansw1 opened 2 years ago

freemansw1 commented 2 years ago

I'm opening this as an overarching issue of resolving the differences between v1.x and v2.x and allowing codebases to work with either without modification. I encourage discussion of this current incompatibility and any ideas on how to link things better. Below is a checklist of everything currently on my mind that causes incompatibility, with the intent of checking things off as we resolve the incompatibility:

freemansw1 commented 2 years ago

I should note that I'm not a regular v2.x user, so for anyone who is, please report other incompatibility issues here. The ultimate goal is to have code work either way, and to provide a transition path even if we ultimately decide to change the structure of v2.x.

fsenf commented 2 years ago

@freemansw1 Thanks for bringing up this discussion point. We might need to do a selection on which methods should run in both version and which not.

For one of our tutorial notebooks, I started to test a relative import header like this:


# relative imports
try:  # version 2.0 structure
    from tobac.themes.tobac_v1 import feature_detection_multithreshold, segmentation, linking_trackpy
    from tobac.plot import map_tracks, animation_mask_field, plot_lifetime_histogram_bar
    print('...imported tobac in version 2.x structure')
except: # version 1.3 structure
    from tobac import feature_detection_multithreshold, segmentation, linking_trackpy
    from tobac import map_tracks, animation_mask_field, plot_lifetime_histogram_bar
    print('...imported tobac in version 1.x structure')

from tobac.utils import get_spacings

see here: https://github.com/fsenf/tobac-tutorials/blob/version_compatibility/tutorials/Example_OLR_Tracking_model/Example_OLR_Tracking_model.ipynb

It is not working properly ... more thoughts are needed, but I guess it is anyway much more than adjusting the import statements :-(

JuliaKukulies commented 2 years ago

The issue raised in #145 is related to that and should be taken into account I guess?

JuliaKukulies commented 2 years ago

@fsenf Yes, this only addresses the first point but could definitely be a good start to make the tutorials work for both versions. Your code works fine for me using either v1.3 or v2.0-dev. Hm, the ImportError you get looks a bit like your tobac/__init__.py file is from version 1, but that you somehow have the version 2 structure in the directory of your tobac installation (that does not contain tobac.analysis.cog_cell())? Have you checked that you did not accidently mix up elements of the two versions in your tobac directory?

fsenf commented 2 years ago

Hmm, you are right! Again, the difference between pip install . and pip install -e .that matters ...

Now, it fails here:

# Determine temporal and spatial sampling:

In [10] : dxy,dt = get_spacings(OLR)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 # Determine temporal and spatial sampling:
----> 2 dxy,dt = get_spacings(OLR)

File ~/TROPOS/proj/2022-06-tobac-devel/tobac/tobac/utils.py:480, in get_spacings(field_in, grid_spacing, time_spacing)
    476 from copy import deepcopy
    478 # set horizontal grid spacing of input data
    479 # If cartesian x and y corrdinates are present, use these to determine dxy (vertical grid spacing used to transfer pixel distances to real distances):
--> 480 coord_names = [coord.name() for coord in field_in.coords()]
    482 if (
    483     "projection_x_coordinate" in coord_names
    484     and "projection_y_coordinate" in coord_names
    485 ) and (grid_spacing is None):
    486     x_coord = deepcopy(field_in.coord("projection_x_coordinate"))

TypeError: 'DataArrayCoordinates' object is not callable

which is somehow expected because version 1.3 expects Iris cubes, right?

An option for the tutorial notebook would be that the user selects version = 1.x or version = 2.x at the beginning (by hand) and then all critical parts go through a if- else decision.

@freemansw1 & @JuliaKukulies : What do you think? Would it be an option for a rather painless update for version compatibility?

freemansw1 commented 2 years ago

I am more inclined to fix the incompatibilities on the tobac side than on the tutorial notebook side, personally. I think the best way to resolve that issue could be to start wrapping functions in the @xarray_to_iris decorators (or something similar) from v2.x and if xarray data is passed, automatically convert it to iris (if iris data is passed, leave it alone).

freemansw1 commented 2 years ago

I've broken up the xarray/iris/pandas checkboxes into input and output above.

JuliaKukulies commented 2 years ago

I am more inclined to fix the incompatibilities on the tobac side than on the tutorial notebook side, personally. I think the best way to resolve that issue could be to start wrapping functions in the @xarray_to_iris decorators (or something similar) from v2.x and if xarray data is passed, automatically convert it to iris (if iris data is passed, leave it alone).

I agree with @freemansw1, fixing the incompabilities on the tobac side may be smoother and a bit less "just fixing the symptoms". This would keep the code in the example notebooks much cleaner and help users to use old or write new code that works in either of the two versions. I think the suggestion above is good. Just remind me: Was the plan in the long long run to completely move to xarray meaning remove the iris dependency and rewrite the code within the old functions that are based on iris?

which is somehow expected because version 1.3 expects Iris cubes, right?

Yes, exactly. So that is maybe one good example that shows that it would be smoother to solve this within tobac rather than only in the tutorial, because we would not have to load the data in two different ways in the tutorials.

I've broken up the xarray/iris/pandas checkboxes into input and output above.

That makes sense, thanks!

freemansw1 commented 2 years ago

Was the plan in the long long run to completely move to xarray meaning remove the iris dependency and rewrite the code within the old functions that are based on iris?

I believe that is the plan, based on the roadmap document and previous discussions. Iris isn't well supported in the current ecosystem and can be pretty inflexible. Xarray also has better dask support, allowing for better use with large datasets.

JuliaKukulies commented 2 years ago

Agreed with that. But yes, working with the decorators first as you suggested is probably a good solution, as it may take longer time to completely migrate the code inside the functions. And we probably would like to keep the one that converts an iris cube to an xarray.DataArray

snilsn commented 2 years ago

If everybody is ok with that I would like to work on a PR that copies the decorators from v2 to v1.x and adds some tests.

Since we talked about this in the last meeting I played around with them for a while and they seem to be working just fine. I wouldn't decorate any function in the code yet.

freemansw1 commented 2 years ago

@snilsn I think we would all be interested in that pull request.

As we make this transition, we will have to be careful with user communication. Not all xarray Datasets are Iris-compatible. I think porting over the decorators is a great first step, though.

fsenf commented 2 years ago

Full support also from my side. And yes, some fields need to be treated with special care when internal transformation from xarray to Iris is made.

freemansw1 commented 1 year ago

With #191 merged, we've now addressed another item on this list. I suspect that will be the only item fixed for v1.5. I think a good goal for v1.6 would be to switch much of the internal functionality over to xarray such that the xarray goal can be addressed as well.

fsenf commented 1 year ago

Minimally invasive workflow for xarray transition

Inspired by the interesting discussions we had the last couple of days, I was re-thinking how we could reach the goal to transition to xarray as fast as possible.

My proposal would be a test-driven development approach sticking to the structure, functions, call schemes we already have and doing the code refactoring a bit later.

Unit testing

A propose the following test-driven workflow:

  1. select a function you like to work on and communicate this, e.g. tell that you like to work on def feature_detection_multithreshold_timestep(...) in tobac/feature_detection.py

  2. add data_type testing in the unit test for this function, e.g. in tobac/tests/test_feature_detection.py

    
    @pytest.mark.parametrize(
        "test_threshs, n_min_threshold, dxy, wavelength_filtering, data_type",
        [
            ([1.5], 2, -1, None, "iris"),
            ([1.5], 2, -1, None, "xarray"),
            ([1, 1.5, 2], 2, 10000, (100 * 1000, 500 * 1000), "iris"),
            ([1, 2, 1.5], [3, 1, 2], -1, None, "iris"),
            ([1, 1.5, 2], {1.5: 2, 1: 3, 2: 1}, -1, None, "iris"),
        ],
    )
    def test_feature_detection_multithreshold_timestep(
        test_threshs,
        n_min_threshold,
        dxy,
        wavelength_filtering,
        data_type,
    ):
    
    ...
        test_data_iris = tbtest.make_dataset_from_arr(test_data, data_type)
    ...
    
  3. Decorate the function you like to work on

    --- a/tobac/feature_detection.py
    +++ b/tobac/feature_detection.py
    @@ -22,6 +22,14 @@ import numpy as np
     import pandas as pd
     from .utils import internal as internal_utils
     from tobac.utils.general import spectral_filtering
    +from tobac.utils.internal import (
    +    xarray_to_iris,
    +    iris_to_xarray,
    +    xarray_to_irispandas,
    +    irispandas_to_xarray,
    +)
    +
    +
     import warnings
    
    @@ -522,7 +530,7 @@ def feature_detection_threshold(
    
         return features_threshold, regions
    
    -
    +@xarray_to_iris
     def feature_detection_multithreshold_timestep(
    ...
    
  4. invert decorater to @iris_to_xarray, update function and run tests frequently with python -m pytest until everything is fine again

Real data testing

We need more real data testing opportunities for the xarray transition.

ideas

@freemansw1 @JuliaKukulies @w-k-jones @kelcyno @deeplycloudy : What do you think about this approach and how we can expand out real data testing options?

fsenf commented 1 year ago

I created a branch exp_iris2xarray (branched from main & merged RC_v1.5) where you can see the details in action

fsenf commented 1 year ago

BTW: I did not change protection and review rules for this experimental branch. Happy if somebody can point me to the place where this can be done ...

w-k-jones commented 1 year ago

I like this plan. I think the first step should be to make a set of really robust (and well tested!) conversion routines between xarray and iris/pandas, that go beyond the existing xarray routines and ensure consistency in e.g. dimension names and handle the proleptic gregorian time weirdness. After that we should be able to test consistently for iris vs xarray input, then change the internals and ensue the results are the same.

Also we definitely need more real data for larger scale tests. Will also come in useful for the new analysis cookbooks!

kelcyno commented 1 year ago

I also like this plan and I agree with will’s note below. For the larger scale tests and more real data, I have several gridded radar cases that are anywhere from 4-24hrs, depending on how big we might need going forward.

-Kelcy

Kelcy Brunner Post Doctoral Researcher TTU, National Wind Institute

On Apr 22, 2023, at 3:07 AM, William Jones @.**@.>> wrote:

This email originated outside TTU. Please exercise cautionhttps://askit.ttu.edu/phishing!

I like this plan. I think the first step should be to make a set of really robust (and well tested!) conversion routines between xarray and iris/pandas, that go beyond the existing xarray routines and ensure consistency in e.g. dimension names and handle the proleptic gregorian time weirdness. After that we should be able to test consistently for iris vs xarray input, then change the internals and ensue the results are the same.

Also we definitely need more real data for larger scale tests. Will also come in useful for the new analysis cookbooks!

— Reply to this email directly, view it on GitHubhttps://github.com/tobac-project/tobac/issues/143#issuecomment-1518555186, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7Z2U7AA3LIPHJMA2BEZMDXCOGUBANCNFSM5ZQ36VHQ. You are receiving this because you were mentioned.Message ID: @.***>

JuliaKukulies commented 1 year ago

BTW: I did not change protection and review rules for this experimental branch. Happy if somebody can point me to the place where this can be done ...

The protection/review rules for a particular branch can be changed here: https://github.com/tobac-project/tobac/settings/branches

JuliaKukulies commented 1 year ago

Thank you for the well developed and detailed plan on how to proceed on this issue @fsenf! Agree with @w-k-jones and I guess the only way to develop more advanced conversion routines we will need real data to be able to know about all naming convention and time oddities, right? Can we collect those somewhere? One issue with the dimension names was for example discussed here

JuliaKukulies commented 1 year ago

I also like this plan and I agree with will’s note below. For the larger scale tests and more real data, I have several gridded radar cases that are anywhere from 4-24hrs, depending on how big we might need going forward. -Kelcy -- Kelcy Brunner Post Doctoral Researcher TTU, National Wind Institute On Apr 22, 2023, at 3:07 AM, William Jones @.**@.>> wrote: This email originated outside TTU. Please exercise cautionhttps://askit.ttu.edu/phishing! I like this plan. I think the first step should be to make a set of really robust (and well tested!) conversion routines between xarray and iris/pandas, that go beyond the existing xarray routines and ensure consistency in e.g. dimension names and handle the proleptic gregorian time weirdness. After that we should be able to test consistently for iris vs xarray input, then change the internals and ensue the results are the same. Also we definitely need more real data for larger scale tests. Will also come in useful for the new analysis cookbooks! — Reply to this email directly, view it on GitHub<#143 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7Z2U7AA3LIPHJMA2BEZMDXCOGUBANCNFSM5ZQ36VHQ. You are receiving this because you were mentioned.Message ID: @.***>

How big is your data @kelcyno ? Will you be able to upload this data to our zenodo repository?

I think in addition to the specific data sets each of us is working with, it would also be nice to add more test cases with commonly used openly accessible datasets (e.g. ERA5, more satellite obs? )

freemansw1 commented 1 year ago

@fsenf I had a thought this weekend that I described briefly in #337. I think another step we can take in concert with the test-driven development approach that you propose here is to add in type hints as we start this transition... I found that it really helped me understand where xarray and iris were used, and it helps us with our goal on #75 as well. What do you think?

fsenf commented 1 year ago

@freemansw1 : Thank you for the suggestion! I actually have no experience with type hints and therefore cannot make a meaningful contribution to the discussion. However, I am not against it, but we should not destroy back compatibility.

freemansw1 commented 1 year ago

@freemansw1 : Thank you for the suggestion! I actually have no experience with type hints and therefore cannot make a meaningful contribution to the discussion. However, I am not against it, but we should not destroy back compatibility.

Agreed. I don't think adding type hints should break anything, especially after I made the change to #337. We had to switch to python >=3.7 first, but we have now done this.

All part of an integrated strategy :)