spedas / pyspedas

Python-based Space Physics Environment Data Analysis Software
https://pyspedas.readthedocs.io/
MIT License
147 stars 58 forks source link

Data Quality Issue Arises When Using tinterpol #966

Open bpwrs opened 1 month ago

bpwrs commented 1 month ago

Following mostly the example mms-examples/advanced/Plasma calculations with PlasmaPy.ipynb, when applying tinterpol to electron temperture it sometimes created negative values.

I was able to track the issue to a specific FPI data quality flag bit 7 (low electron number density). This data flag is used when (after subtracting the spacecraft photoelectron signal) the ambient number density is extremely low (i.e. nonphysical). This low density, when used to produce the electron temperature produces very low temperature values. Than applying tinterpol to these very low tempertures can produce negative (very near zero) values. These negative values crash plasmapy.formulary.dimensionless.beta in calculating the plasma beta.

Easiest way to fix this would be include a check in tinterpol that takes the absolute value of any negative (very near zero) values. Could also make them exactly zero but I am unsure as to if plasmapy.formulary.dimensionless.beta can handle zero. Also should probably include some type of warning if this is preformed.

jameswilburlewis commented 1 month ago

Thanks for letting us know! Is there a time range I can use with that notebook to reproduce the issue?

Are you saying that tinterpol is returning negative values, even when the inputs are all non-negative? I suppose this could happen even with linear interpolation, if data is being extrapolated outside the time range of the input data.

We are currently working on merging several of the pyspedas/pytplot interpolation methods into a single routine,. and allowing the user to specify the desired behavior via keyword arguments. It sounds like this example would make a good test case.

bpwrs commented 1 month ago

I am able to recreate the error with MMS FPI data in the range ["2018-07-01/01:00", "2018-07-02/00:10"]

Yes, all inputs are nonnegative (albeit looking at it again now it appears most of the issue data points are zeros) and tinterpol returns a negative value.

jameswilburlewis commented 1 month ago

Hmm, actually I'm seeing some large negative values in the Tpara and Tperp variables. I've changed the time range to this:

trange = ["2018-07-01/01:00", "2018-07-02/00:10"]

Then, after the "Extract data values" step, I added a cell with:

tpy = des_Tpara.y
tperpy = des_Tperp.y
import numpy as np
print(np.min(tperpy))
print(np.min(tpy))

And the result I'm seeing is:

-26065392781.662598 eV
-71296377078.7998 eV

There seem to be about ten(ish) negative values at various indices in both the Tpara and Tperp input variables. So I'm not surprised that there are negative values in the interpolated data.

Is it possible we're looking at different data...are you still working with MMS1 in your version of the notebook?

bpwrs commented 1 month ago

Looking at mine again I am also getting some very large negative numbers and I was looking at fast data not burst. If I look at the noninterpolated temps from burst data I see the following.

t=0
for i in des_tperp.y:
    if i < 0:
        t=t+1

print(t)
print(np.min(des_tperp.y))
23
-26065394000.0
t=0
for i in des_tpara.y:
    if i < 0:
        t=t+1

print(t)
print(np.min(des_tpara.y))
87
-71296380000.0
bpwrs commented 1 month ago

I've managed to recreate my original issue (with number density instead of temperature). With the same time range and the following code

tinterpol('mms1_des_numberdensity_brst', 'mms1_dis_numberdensity_brst')
des_numb=get_data('mms1_des_numberdensity_brst')
des_numb_itrp=get_data('mms1_des_numberdensity_brst-itrp')

t=0
for i in des_numb.y:
    if i < 0:
        print(i)
        t=t+1

print(t) 

t=0
for i in des_numb_itrp.y:
    if i < 0:
        print(i)
        t=t+1

print(t)

and looking at my output

0

-4.336808689942018e-19
-4.336808689942018e-19
-3.469446951953614e-18
-1.734723475976807e-18
-2.168404344971009e-19
-2.168404344971009e-19
-2.168404344971009e-19
-1.734723475976807e-18
-1.734723475976807e-18
-1.0842021724855044e-19
10

Every time it happens the corresponding term in the non-interpolated (des_numb.y) data is a zero.

jameswilburlewis commented 1 month ago

Yes, I can reproduce the issue with the density variables. It's weird...as you say, the input data value is 0, the time it's interpolating to is an exact match between the two variables, yet the interpolated value is ever so slightly negative. The actual interpolation is being done with the xarray.interp() method (which itself seems to be calling scipy.interp1d).

I'll have to step through it a few more times, to see if anything weird is happening to the times or data values before they get handed off to xarray.interp(). The timestamps are being passed as np.datetime64[ns] objects -- maybe xarray is doing some sort of lossy conversion?

Meanwhile, until we find a solution (or at least an explanation) for the weird interpolation results, I think it might be a good idea to implement a sanitization step (perhaps using pytplot.clip() or pyspedas.yclip()) to ensure no negative values are being passed to other libraries that can't handle them. But I don't think this is something that should happen within the interpolation routine itself.

jameswilburlewis commented 1 month ago

I think I might have found the issue! I have the mms-examples repo loaded into a separate virtual environment from my pyspedas development environment. When I tried turning the notebook example into a unit test in my pyspedas environment, I couldn't reproduce the negative values in mms1_des_numberdensity_brst-itrp...the minimum values were 0.0, just as we would have expected.

The mms-examples environment I was using had some older versions of some of the package dependencies....in particular, the one I think was causing the problem was the cdflib package. cdflib had a major version release a while back, but there were some breaking changes, so we kept the cdflib version pinned at < 1.0.0 until the pyspedas and pytplot-mpl-temp code could be updated. Now pyspedas and pytplot-mpl-temp can work with the latest cdflib versions, so the cdflib < 1.0.0 restriction has been removed. Updating pyspedas or pytplot-mpl-temp versions probably would not have automatically updated the cdflib package.

So, I suspect you still might have cdflib 0.4.9 in your environment. More recent versions of cdflib have changed the handling of CDF timestamps to have nanosecond rather than microsecond precision.

Could you make a note of what versions of pyspedas, pytplot-mpl-temp, and cdflib you're running in the environment where you're seeing the interpolation problem? Then, try updating to the latest available versions of all three packages, and see if the issue goes away (and I'm pretty sure it will). If not, let me know, and we'll keep digging.

bpwrs commented 1 month ago

Apologies I was on vacation and off my computer for a few days. My packages are as follows

pyspedas 1.5.17 pytplot-mpl-temp 2.2.39 cdflib 1.3.1

I still updated all packages just to be safe and the issue is still occuring. Adding another wrinkle to the issue, earlier you mentioned that the times are an exact match between the two variables, that is true for the "fast" data. However, that is not true for the "brst" data and the interp issue still arises. The "brst" data times are slightly separated and there are also more of these negative values occuring. Interestingly too is that the negative values are not occuring at the same indices in the 'brst' data as they are in the 'fast' data.

time_range = ["2018-07-01/01:00", "2018-07-02/00:10"]
des_numb=get_data(prefix+'des_numberdensity_brst')
dis_numb=get_data(prefix+'dis_numberdensity_brst')
des_numb_itrp=get_data(prefix+'des_numberdensity_brst-itrp')

for i in range(0,len(des_numb_itrp.y)):
    if des_numb_itrp.y[i] < 0:
        # print(i)
        print(dis_numb.times[i])
        print(des_numb.times[i])
        print('\n')

and

1530450174.872731
1530449423.337674

1530450248.373229
1530449498.068179

1530450272.82339
1530449502.958206

1530450292.9235241
1530449506.978234

...

1530481294.228234
1530450035.221795

1530481304.2782989
1530450037.2318091

1530481454.4292748
1530450067.26201
jameswilburlewis commented 1 month ago

Thanks for the additional info -- guess I'll reopen this an take another look.

jameswilburlewis commented 1 month ago

I can't reproduce the issue. Here's my test case:

    def test_tinterpol_nonnegative(self):
        import pytplot
        trange = ["2018-07-01/01:00", "2018-07-02/00:10"]

        # Test that tinterpol does not introduce negative values when input contains
        # zeroes.  (This was actually an issue with cdflib 0.4.9 and older pytplot-mpl-temp timestamp handling, not tinterpol itself
        pyspedas.mms.fpi(datatype=['dis-moms', 'des-moms'], trange=trange, data_rate='brst', time_clip=True,
                         center_measurement=True)
        tinterpol('mms1_des_numberdensity_brst', 'mms1_dis_numberdensity_brst')
        des_n_before = pytplot.get('mms1_des_numberdensity_brst')
        des_n = pytplot.get('mms1_des_numberdensity_brst-itrp')
        self.assertTrue((des_n_before.y >=  0.0).all)
        self.assertTrue((des_n.y >=  0.0).all)
        prefix='mms1_'
        des_numb = get_data(prefix + 'des_numberdensity_brst')
        dis_numb = get_data(prefix + 'dis_numberdensity_brst')
        des_numb_itrp = get_data(prefix + 'des_numberdensity_brst-itrp')

        for i in range(0, len(des_numb_itrp.y)):
            if des_numb_itrp.y[i] < 0:
                # print(i)
                self.assertTrue(1==0). # If this is reached the test will fail
                print(dis_numb.times[i])
                print(des_numb.times[i])
                print('\n')

This test passes, with no negative values found.

Are we both still looking at MMS1, loaded with the same options?

Here's more info about my environment:

Python: 3.9.18 Platform: Apple Silicon (M2) Mac, Ventura 13.6.9 xarray: 2024.7.0 scipy: 1.11.2 numpy: 1.25.2

bpwrs commented 1 month ago

xarray 2023.6.0 numpy 1.23.2 python 3.10.13 scipy 1.11.3

Funny, the issue disappeared by including the

center_measurement=True

line in the FPI load statement. Can't seem to see any differences that may cause that. The times are still slightly different and I believe that center measurement only affects the time point of the measurements, correct?

jameswilburlewis commented 1 month ago

Thanks for checking! I can reproduce the issue if I set center_measurement to False. I changed the debugging loop to this:

        des_numb = pytplot.get(prefix + 'des_numberdensity_brst')
        dis_numb = pytplot.get(prefix + 'dis_numberdensity_brst')
        des_numb_itrp = pytplot.get(prefix + 'des_numberdensity_brst-itrp')

        for i in range(0, len(des_numb_itrp.y)):
            if des_numb_itrp.y[i] < 0:
                # print(i)
                #self.assertTrue(1==0)
                print("DIS time at negative result:",dis_numb.times[i])
                print("Interpolated DES time/val at negative result:",des_numb_itrp.times[i],des_numb_itrp.y[i])
                print("DIS timess around negative result:",dis_numb.times[i-2:i+3])
                print("Interpolated DES times/vals around negative result:", des_numb_itrp.times[i-2:i+3],des_numb_itrp.y[i-2:i+3])
                input_index = np.searchsorted(des_numb.times,des_numb_itrp.times[i])
                print("Index of interpolated time into input array:", input_index)
                print("Input DES times/vals around negative result:", des_numb.times[input_index-2:input_index+3],
                      des_numb.y[input_index-2:input_index+3])
                print('\n')

and here's the output I'm getting for the first negative value:

DIS time at negative result: 2018-07-01T13:02:54.872731000
Interpolated DES time/val at negative result: 2018-07-01T13:02:54.872731000 -4.336808689942018e-19 1 / cm3
DIS timess around negative result: ['2018-07-01T13:02:54.572731000' '2018-07-01T13:02:54.722731000'
 '2018-07-01T13:02:54.872731000' '2018-07-01T13:02:55.022738000'
 '2018-07-01T13:02:55.172738000']
Interpolated DES times/vals around negative result: ['2018-07-01T13:02:54.572731000' '2018-07-01T13:02:54.722731000'
 '2018-07-01T13:02:54.872731000' '2018-07-01T13:02:55.022738000'
 '2018-07-01T13:02:55.172738000'] [ 4.12966423e-02  1.98426768e-02 -4.33680869e-19  3.69412014e-02
  3.04065924e-02] 1 / cm3
Index of interpolated time into input array: 18393
Input DES times/vals around negative result: ['2018-07-01T13:02:54.812731000' '2018-07-01T13:02:54.842731000'
 '2018-07-01T13:02:54.872731000' '2018-07-01T13:02:54.902731000'
 '2018-07-01T13:02:54.932731000'] [0.02282705 0.00357686 0.         0.01555543 0.03359179] 1 / cm3

Note that I'm using pytplot.get instead of pytplot.get_data here: this preserves the original np.datetime64 time values and the astropy units added in cdf_to_tplot, matching what xarray.interp() is working on internally. I guess either of those factors could be contributing to the unexpected result we're seeing.

Next step will be to construct a standalone example that only uses numpy, xarray, and/or scipy to reproduce the issue, and see where that leads us.

jameswilburlewis commented 4 weeks ago

OK, I'm able to reproduce the issue purely using numpy, xarray, and scipy. Here's a test case I extracted from the MMS data:

    def test_xarray_interp(self):
        import xarray as xr
        import numpy as np

        time_strings_input = np.array(['2018-07-01T13:02:16.892474880',
                                       '2018-07-01T13:02:16.922475008',
                                       '2018-07-01T13:02:16.952474880'])
        values_input = np.array([0.028584518, 0., 0.013626526],dtype=np.float32)
        time_strings_interp_to = np.array(['2018-07-01T13:02:16.922475008'])

        input_times_npdt64 = np.array([np.datetime64(t) for t in time_strings_input])
        interp_to_times_npdt64 = np.array([np.datetime64(t) for t in time_strings_interp_to])

        data_array = xr.DataArray(values_input,dims=['time'],coords={'time':('time',input_times_npdt64)})

        result = data_array.interp({"time": interp_to_times_npdt64},method='linear')
        print(result.values)

Output: [-3.46944695e-18]

A similar example calling scipy.interpolate.interp1d directly (after converting the np.datetime64 objects to float64) gives the same result. (xarray is calling interp1d internally).

If I change the type of values_input from float32 to float64, we get the expected result of 0.

scipy.interpolate.interp is called out in the scipy documentation as a "legacy" routine, so it's unlikely they'll be interested in fixing this issue. I wouldn't hold my breath waiting for xarray to switch to a different interpolator, either.

numpy.interp seems to do the right thing in this specific case, but it only does 1-D piecewise linear interpolation, and doesn't offer any nonlinear or nearest-neighbor options.

The workaround, for now, is basically what I suggested before: sanitize the interpolated data before passing it to anything that expects non-negative, non-nan, or other restricted domains.

We'll be sure to take this case into account as we develop our one-size-fits-all, general-purpose interpolation routine for pyspedas, but it's going to take a while before we get there....

jameswilburlewis commented 1 week ago

Reopening here to track the issue I filed on the xarray repo. They're going to look into using a different scipy interpolation method that doesn't suffer from this issue.

https://github.com/pydata/xarray/issues/9404