podaac / l2ss-py

Level 2 subsetter with Harmony integration
https://podaac.github.io/l2ss-py/
Apache License 2.0
11 stars 11 forks source link

Feature/issue-142: duplicated dimension error with TEMPO ozone profile #141

Closed danielfromearth closed 1 year ago

danielfromearth commented 1 year ago

Github Issue: #142

Description

Modified the remove_duplicate_dims() function so that the subsetter pipeline does not fail when encountering the duplicated "layer" dimension in TEMPO ozone profile data (variables: support_data/ozone_averaging_kernel and support_data/ozone_noise_correlation_matrix)

Overview of work done

Added checks in remove_duplicate_dims() for whether the Dimension and Variable corresponding to the duplicated dimension already exist in the NetCDF. The function directly writes the new variable with no duplicated dimensions rather than keeping it with an altered name that needs to be renamed later in the subsetting procedure.

Overview of verification done

Added a new unit test for TEMPO ozone profile data. Checked that all automated tests passed successfully.

Overview of integration done

Explain how this change was integration tested. Provide screenshots or logs if appropriate. An example of this would be a local Harmony deployment.

PR checklist:

See Pull Request Review Checklist for pointers on reviewing this pull request

danielfromearth commented 1 year ago

@nlenssen2013, @frankinspace, I've made some edits to the remove_duplicate_dims() function in this branch in order to allow the subsetter to work with TEMPO Ozone proxy data files, which have duplicated dimensions in two of the variables. This change seems to be working well without the need to hold off variable renaming (with rename_dup_vars() using xarray) later in the subset pipeline. I'm a bit unsure whether this will work for all cases though. Are there additional files we can do regression tests with, besides the ones used by the automated tests in the tests/data directory?

nlenssen2013 commented 1 year ago

@nlenssen2013, @frankinspace, I've made some edits to the remove_duplicate_dims() function in this branch in order to allow the subsetter to work with TEMPO Ozone proxy data files, which have duplicated dimensions in two of the variables. This change seems to be working well without the need to hold off variable renaming (with rename_dup_vars() using xarray) later in the subset pipeline. I'm a bit unsure whether this will work for all cases though. Are there additional files we can do regression tests with, besides the ones used by the automated tests in the tests/data directory?

To my knowledge, GESDISC only has two collections with this duplicate dimension issue in SNDR and Tropomi. Both of those files get tested in unit test.

danielfromearth commented 1 year ago

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

nlenssen2013 commented 1 year ago

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

danielfromearth commented 1 year ago

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

@nlenssen2013, you mean rename_dup_vars() is still needed? Just double-checking with you, since the latest version of this branch no longer uses that function at all (actually, I was thinking the function could now be deleted too) — is that okay?

If it's okay, I'll just go ahead and squash+merge this. If not, we might need to make a few more tweaks to use that rename_dup_vars() function again.

nlenssen2013 commented 1 year ago

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

@nlenssen2013, you mean rename_dup_vars() is still needed? Just double-checking with you, since the latest version of this branch no longer uses that function at all (actually, I was thinking the function could now be deleted too) — is that okay?

If it's okay, I'll just go ahead and squash+merge this. If not, we might need to make a few more tweaks to use that rename_dup_vars() function again.

I'm not getting all the pytest tests/ to pass without it currently, not sure if you are. Not a problem for me if it gets merged though - was able to run an individual subset on that collection and it looks good.

danielfromearth commented 1 year ago

@nlenssen2013. hmmm, I just reran using pytest, and all of the tests in tests/ passed on my laptop from this branch (danielfromearth:feature/currently-no-issue-duplicated-layer-dim). If they aren't passing for you, are you definitely using the same branch and pulled the latest?

Perhaps we merge this into the podaac:feature branch, and then let the automated test suite — run via GitHub actions — serve as the more authoritative check before merging it into develop.

nlenssen2013 commented 1 year ago

@nlenssen2013. hmmm, I just reran using pytest, and all of the tests in tests/ passed on my laptop from this branch (danielfromearth:feature/currently-no-issue-duplicated-layer-dim). If they aren't passing for you, are you definitely using the same branch and pulled the latest?

Perhaps we merge this into the podaac:feature branch, and then let the automated test suite — run via GitHub actions — serve as the more authoritative check before merging it into develop.

Yep, I'm fine with that