pysat / pysatNASA

pysat support for NASA Instruments
BSD 3-Clause "New" or "Revised" License
21 stars 7 forks source link

BUG: GOLD metadata #110

Closed jklenzing closed 2 years ago

jklenzing commented 2 years ago

Description

Addresses #97

Several updates are required to GOLD NMAX dimensions in xarray. These changes are performed after loading, and meta for new coordinates were not updated.

Fixes the loading of metadata so defaults are not used.

Type of change

How Has This Been Tested?

import pysat
gold = pysat.Instrument('ses14', 'gold', tag='nmax', strict_time_flag=False)
gold.load(2021, 1)
gold.meta.data

Inspect metadata for "Default metadata added" messages in notes.

Test Configuration

Checklist:

jklenzing commented 2 years ago

Putting this up as a draft for discussion. Open questions: 1) Should pysat search xarray dimensions for metadata values? Currently not implemented. 2) Should we be setting dimensional values as coordinates?

aburrell commented 2 years ago

Edited above comment to change the list from itemized to enumerated.

For (1), I think that yes, we should do that. However, it should be optional as it would require the user to map metadata values onto pysat metadata labels. Alternatively, we could just do it and if the labels are different add them as extra labels.

For (2), the answer is usually yes. Can't think of a case where we shouldn't, but am probably wrong! 🐟

rstoneback commented 2 years ago

(1) For my load_meta_proc branch all the metadata for variables in xarraydata.variables is copied over.

    # Copy the variable attributes from the data object to the metadata
    for key in data.variables.keys():

Not sure what the fix is but if we are not including some metadata then we should include it.

(2) I think that may end up being an instrument specific thing. I'm for making the data the best possible.

aburrell commented 2 years ago

In response to:

(1) For my load_meta_proc branch all the metadata for variables in xarraydata.variables is copied over.

Should this be its own function? Seems like it would also be useful for the future when we add inst.to_xarray() functionality.

rstoneback commented 2 years ago

In response to:

(1) For my load_meta_proc branch all the metadata for variables in xarraydata.variables is copied over.

Should this be its own function? Seems like it would also be useful for the future when we add inst.to_xarray() functionality.

In this case the meta is being transferred from xarray to meta. There is a function pysat_meta_to_xarray_attr, which I think you started, that transfers to xarray and should cover what you are thinking. I have modified it a bit in the latest branch. Duties shifted a bit.

jklenzing commented 2 years ago

Testing this branch out against load_meta_proc, it no longer loads due to the data dimension check. The time information is not stored in data.dims.

jklenzing commented 2 years ago

Another question raised by this branch: The metadata fill value for data products that are strings (channel, hemisphere, etc) are '*', which is not allowed by pysat and dropped. Opening an issue there.

rstoneback commented 2 years ago

Another question raised by this branch: The metadata fill value for data products that are strings (channel, hemisphere, etc) are '*', which is not allowed by pysat and dropped. Opening an issue there.

Thanks for the issue. netCDF4 doesn't allow me to write files with fill information for string types. I wonder how they got that in there.

pysat metadata currently only supports one type per label, and fill is set for float values.

rstoneback commented 2 years ago

Testing this branch out against load_meta_proc, it no longer loads due to the data dimension check. The time information is not stored in data.dims.

Thanks for the heads up. A fix has been uploaded.

mandrakos commented 2 years ago

Thanks for the issue. netCDF4 doesn't allow me to write files with fill information for string types. I wonder how they got that in there.

pysat metadata currently only supports one type per label, and fill is set for float values.

Is there difference between the variable attribute FILLVAL and the reserved _FillValue attribute? I just use the IDL NCDF_ATTPUT procedure when creating these files to set FILLVAL and don't do anything with _FillValue.

rstoneback commented 2 years ago

Thanks for the issue. netCDF4 doesn't allow me to write files with fill information for string types. I wonder how they got that in there. pysat metadata currently only supports one type per label, and fill is set for float values.

Is there difference between the variable attribute FILLVAL and the reserved _FillValue attribute?

No. pysat is borrowing from ICON's file standard a bit and, for broadest compatibility, ICON's standard is a combo of a couple standards. I think the repeated fill information is for netCDF and SPDF compatibility.

I just use the IDL NCDF_ATTPUT procedure when creating these files to set FILLVAL and don't do anything with _FillValue.

Thanks for the info. The python netcdf library has yelled at me in the past about string metadata.

rstoneback commented 2 years ago

Another question raised by this branch: The metadata fill value for data products that are strings (channel, hemisphere, etc) are '*', which is not allowed by pysat and dropped. Opening an issue there.

I can't solve the metadata support part in the near term, but in practical terms, if the '*' values are replaced with '' within the GOLD load function, then users may still get the fill message even though it isn't documented in the pysat metadata.

jklenzing commented 2 years ago

It's not a breaking change, and outside the scope here. Just noting it for down the line.

jklenzing commented 2 years ago

Experiencing more issues with load_meta_proc branch. Since the functionality has to be updated for all xarray instruments in this package, I propose we push that compatibility to a future pull. Will document problems via issues or the pysat pull.

jklenzing commented 2 years ago

@rstoneback, if I revert the change in pysat.utils.io on line 1339 to

if epoch_name in data.dims:

then this branch is currently compatible with load_meta_proc.

Since xarray doesn't load a time value into the Dataset as a coordinate, I have to use the 'nscans' index as a dummy epoch. pysat develop will load this as an integer index, pysat load_meta_proc will load as a datetime index starting at 1970. In either case, the 'time' coordinate is replaced by actual times calculated from the scan start time.

In pysat develop, the swap_dims command fixes the indexing problem caused by treating 'nscans' as the epoch. For load_meta_proc, this command creates no change. It could be cleaned up down the line.

I think the issue was cause by us moving in different directions on the two pulls, so the two fixes canceled each other out and created a new bug.

jklenzing commented 2 years ago

TODO:

rstoneback commented 2 years ago

I get an error if Instrument has strict times. The time actually decreases in an area.

'2020-01-01T23:55:24.000000000', '2020-01-01T23:55:22.000000000'
rstoneback commented 2 years ago

I pushed some changes to load_meta_proc. Gold is loading on my end.

rstoneback commented 2 years ago

You may also want to look at the SPDF example in writing files. Providing an expanded labels to the load_... function will improve metadata use for users.

rstoneback commented 2 years ago

You may have already responded to this suggestion before.

How about making channels 'a' and 'b' inst_id values. Wouldn't that produce a continuous monotonic time signal for each inst_id?

pandas slicing and other things break when the time index isn't monotonic and unique.

jklenzing commented 2 years ago

You may have already responded to this suggestion before.

How about making channels 'a' and 'b' inst_id values. Wouldn't that produce a continuous monotonic time signal for each inst_id?

pandas slicing and other things break when the time index isn't monotonic and unique.

That's a long-term goal. In pysat terms, I think we want to have the two channels as two separate inst_id values. But since that would involve loading data twice, I had wanted to upgrade the constellation object to load a single file and break it apart into the two instruments. Likewise, it would result in duplicate downloads, etc. See pysat/pysat#614

jklenzing commented 2 years ago

Thanks for the updated metadata. It's looking good when I run locally. The downside is that this pull is no longer backwards-compatible with previous releases, but we might as well push toward the future and set a version limit. The alternative is to use a try/except around the load statement and warn users that some metadata may be dropped with the current version of pysat. Any preference?

rstoneback commented 2 years ago

Thanks for the updated metadata. It's looking good when I run locally. The downside is that this pull is no longer backwards-compatible with previous releases, but we might as well push toward the future and set a version limit. The alternative is to use a try/except around the load statement and warn users that some metadata may be dropped with the current version of pysat. Any preference?

Not strongly. I did put in a version check for sami2 in pysatModels for compatibility with both. While the old style worked for both current and upcoming pysat, I wanted to get the new style in. So I have done the compatibility route. However since there are more changes (or will be) in pysatNASA with the upcoming pysat metadata functionality via load_meta_proc I lean towards putting in a version limit.

jklenzing commented 2 years ago

OK, I've cleaned up the code assuming that load_meta_proc will be in the next pysat version. This pull goes in the waiting bin for the next pysat. It still needs the version cap, which I will add after the pysat RC is up.

jklenzing commented 2 years ago

New pysat is out! Time to clean up the queue of reviews. Currently failing in some jobs due to overload at coveralls, but the tests all pass. Will rerun jobs as needed to fix the coveralls issue.

jklenzing commented 2 years ago

Double-checked that GA is using the latest xarray (2022.6), and things are working ok here. We ran into some issues at sami2py with this as well.

The bug fixed here is present in most of the other pulls. Merging in from develop should fix most of those.