hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
98 stars 32 forks source link

Compute Integral of Monthly-sampled Survey Stations #321

Closed jbonil11 closed 1 year ago

jbonil11 commented 1 year ago

Description

I am trying to integrate the V velocity with respect to Z using a survey station dataset with a time frequency of "1M" (there will be 12 snapshots)

When using the "od_surv.compute.integral(varNameList="V", axesList=["Z"])" function, the computed I(V)dZ does not include the time dimension, but instead there is only a single array with the "station" dimension. I expected there to be an array with a "time" and "station" dimension, since I had 12 different time snapshots. However, if I change the time frequency to "2D" when extracting the survey stations, the "compute.integral" function works as expected and the resulting I(V)dZ includes the "time" dimension.

Also, the same things happen when running "EGshelfIIseas2km_ASR_full" instead of IGP.

What I Did

import oceanspy as ospy
od = ospy.open_oceandataset.from_catalog("IGPyearlong",'/home/idies/workspace/Storage/jonjonmanuel/persistent/catalog_xarray.yaml')

# Kjetil's information
lats_Kjetil = [71.15,71.15]
lons_Kjetil = [-21.7,-16]
depth_Kjetil = [0, -2000]

# Select time range:
# September 2007, extracting one snapshot every 3 days
timeRange = ["2017-09-01T00", "2018-08-31T18"]

timeFreq = "1M"
# Spacing between interpolated stations
delta_Kjetil = 5  # km

# Extract survey stations
# Reduce dataset to speed things up:
od_surv = od.subsample.survey_stations(
    Xsurv=lons_Kjetil,
    Ysurv=lats_Kjetil,
    delta=delta_Kjetil,
    ZRange=depth_Kjetil,
    timeRange=timeRange,
    timeFreq=timeFreq,
    varList=["Temp","S","U","V","drC", "drF","HFacC","HFacW","HFacS"],
)

od_surv = od_surv.compute.potential_density_anomaly()
# Integrate along Z
od_surv = od_surv.compute.integral(varNameList="V", axesList=["Z"])

Paste the command(s) you ran and the output. If there was a crash, please include the traceback here.

jbonil11 commented 1 year ago

It also doesn't work if you chose the time frequency to be "1W".

Mikejmnez commented 1 year ago

Thank you Joan @jbonil11 for raising the issue. I will look at this today (I am already working on some oceanspy stuff). Hopefully I will be able to figure this out today.

ThomasHaine commented 1 year ago

@jbonil11 has this now been revised? You said in group meeting that it was a coding bug I think?

jbonil11 commented 1 year ago

@ThomasHaine

Yes I believe it is a bug, when we ran the Kogur notebook and extracted the stations using a "1M" time frequency, the compute.integral function would eliminate the time variable of the output. I just did a quick check and it is still happening. I would like to point out that this only happens when I do not include the "sampMethod='mean'" part in the extraction code, and extract snapshots of the model.

However, when I extract the survey stations using "sampMethod='mean'" the "compute.integral" works perfectly fine even when I use a "1M" time frequency. Hope this helps.

ThomasHaine commented 1 year ago

@jbonil11 Please send me the yaml file for the IGPYearLong dataset you use here. I can't read your directory on SciServer.

ThomasHaine commented 1 year ago

@jbonil11 I've run your test code above (slightly modified to fix a couple of errors). I used EGshelfIIseas2km_ASR_full on SciServer because I don't have access to IGPyearlong. Specifically:

import oceanspy as ospy
!pip list | grep oceanspy

od = ospy.open_oceandataset.from_catalog("EGshelfIIseas2km_ASR_full")

# Kjetil's information
lats_Kjetil = [71.15,71.15]
lons_Kjetil = [-21.7,-16]
depth_Kjetil = [0, -2000]

# Select time range:
timeRange = ["2007-09-01T00", "2008-08-31T18"]

timeFreq = "1M"     # This works as expected, yielding a dataset with a time dimension upon integration.
# Spacing between interpolated stations
delta_Kjetil = 5  # km

# Extract survey stations
# Reduce dataset to speed things up:
od_surv = od.subsample.survey_stations(
    Xsurv=lons_Kjetil,
    Ysurv=lats_Kjetil,
    delta=delta_Kjetil,
    ZRange=depth_Kjetil,
    timeRange=timeRange,
    timeFreq=timeFreq,
    varList=["Temp","S","U","V","drC", "drF","HFacC","HFacW","HFacS"],
)

od_surv = od_surv.compute.potential_density_anomaly()

# Integrate along Z
od_surv = od_surv.compute.integral(varNameList="V", axesList=["Z"])
od_surv.dataset

This doesn't give the error. I.e., the integral has a time dimension, as expected. Please take a look and let me know if I've missed the point (and if so, how to reproduce the error).

Mikejmnez commented 1 year ago

@ThomasHaine The issue is still there.

Below are two minimal examples, one that reproduces the issue and, following @jbonil11 , another example that doesn't have the issue. Both use Kogur.ipynb as a template:

import oceanspy as ospy

# ==========
# Start client perhaps not necessary
from dask.distributed import Client
client = Client()
# ==========

od = ospy.open_oceandataset.from_catalog("EGshelfIIseas2km_ASR_full")

# define survey/cutout properties:

lats_Kjetil = [71.15,71.15]
lons_Kjetil = [-21.7,-16]
depth_Kjetil = [0, -2000]

timeRange = ["2007-09-01", "2008-08-31T18"]
timeFreq = "1M"
delta_Kjetil = 5  # km

# 1) First survey - defines explicitly `sampMethod="mean"`:

od_surv_samp_method = od.subsample.survey_stations(
    Xsurv=lons_Kjetil,
    Ysurv=lats_Kjetil,
    delta=delta_Kjetil,
    ZRange=depth_Kjetil,
    timeRange=timeRange,
    timeFreq=timeFreq,
    varList=["U","V","drC", "drF","HFacC","HFacW","HFacS"],
    sampMethod='mean',
)

# 2) another survey now without explicitly defining `sampMethod`.
# Instead, it uses the default `sampMethod:"snapshot" ` :

od_surv = od.subsample.survey_stations(
    Xsurv=lons_Kjetil,
    Ysurv=lats_Kjetil,
    delta=delta_Kjetil,
    ZRange=depth_Kjetil,
    timeRange=timeRange,
    timeFreq=timeFreq,
    varList=["U","V","drC", "drF","HFacC","HFacW","HFacS"],
)

# compute vertical integrals of vertical velocity (downsampled):

od_surv_samp_method = od_surv_samp_method.compute.integral(varNameList="V", axesList=["Z"])

od_surv = od_surv.compute.integral(varNameList="V", axesList=["Z"])

The following snapshot shows a comparison between the dimensions of variable V before and after being integrated in Z, for the two OceanDatasets.

Screen Shot 2023-04-12 at 1 55 38 PM

Like @jbonil11 mentions, in the case of od_surv, vertically integrating V results in a new variable I(V)dZ that does not have timeas dimension.

In turn, explicitly defining sampMethod="mean" as argument when carrying the original survey (od_surv_samp_method) does not have the issue.

Quick fix:

This seems to me like an issue of using inconsistent arguments when carrying the survey (or cutout).

This is, by default:

sampMethod='snapshot'

So when downsampling to have timeFreq='1M', or any other timeFreq that leads to downsampling, it is more consistent to define and pass as argument:

sampMethod='mean'

I do think this is a bug though, since as @jbonil11 mentioned using

timeFreq="1W" 

yeilds the same behavior but defining

timeFreq="3D"

does not experience the same issues (time remains a dimension on the vertically integrated V, despite using the default sampMethod). The Kogur,ipynb notebook has timeFreq="3D" and uses the default sampMethod which is the reason why the notebook runs well...

Mikejmnez commented 1 year ago

Just to add to the discussion. I did a quick experiment:

defined

timeFreq_days = "7D"
timeFreq_week = "1W"

When using timeFreq_days there is no problem, but using timeFreq_week there is... Same thing when timeFreq="30D" and timeFreq="1M".

It seems when the frequency is defined in days (with "D" ) there is no problem, but when using units of "W" (weeks) or "M" (month), there is an issue.

ThomasHaine commented 1 year ago

Interesting. I reproduced your results. It sounds like there's some subtle bug in the pandas code that parses timeFreq.

MaceKuailv commented 1 year ago

The root cause is that the grid variable drF is broadcasted with a time dimension during cutout. image

MaceKuailv commented 1 year ago

image This the operation for when sampMethod is mean, which explicitly separated the time dependent part and variables that does not depends on time. Which explains why when sampMethod is "mean" there is no error. image Even if the sampMethod is snapshot, there is still a chance that the intervals are the same(falling into if). However, if it falls into "else", everything, including grids will be concatenated.

This is not a issue with compute or survey, but in subsample.cutout. If I am not facing the environment issue I have now, I would have just done a PR...

Mikejmnez commented 1 year ago

Very nicely done @MaceKuailv . Is it true that it only happens with drF and no other grid coordinate (e.g. dxC). Or it just happens that drF is picked up before all other grid coordinates and, in the case drF was absent from the dataset, the error (concatenation) would happen with some other grid coordinate?

MaceKuailv commented 1 year ago

My understanding is every variable will have a time dimension to it. Within the xr.concat statement, ds.sel(time = time) will result in a dataset without time dimension, which means the time-dependent variables are indistinguishable from timeless ones.

malmans2 commented 1 year ago

Hey! A couple of tips:

  1. It looks like looping and concatenating is not really needed. You can probably just do something like: ds = ds.sel(time=newtime)
  2. xarray added the argument keep_attrs in many functions, definitely it was not available when we started working on OceanSpy. To preserve attributes, you can also use a context manager instead. For example:
    with xr.set_options(keep_attrs=True):
    ds = ds * 1
    ...
    # Here all attributes should be preserved

(I'm not really following the development though, so feel free to ignore if they don't make sense!)

MaceKuailv commented 1 year ago

This is correct. I used isel for both of the different cases and didn't concat, which automatically preserves the attributes, so not really an issue for these changes.

I agree it would be very useful to use keep_attrs option, and we do have a lot of code working on the attrs. I guess we just have to simplify them in slow time.

MaceKuailv commented 1 year ago

closed with PR #348