nsidc / granule-metgen

Metadata generator for direct-to-Cumulus era
Other
0 stars 0 forks source link

Populate spatial content in UMM-G files #67

Closed juliacollins closed 4 weeks ago

juliacollins commented 2 months ago

Issue #45 stubbed out the geometry section of the UMM-G output with hard-coded values. Add the necessary logic to pull geospatial information from a netCDF file and add it to the UMM-G output. The goal is to handle the geospatial metadata generation automatically, not with an override file created by the operator.

Acceptance criteria:

afitzgerrell commented 1 month ago

Image Image

juliacollins commented 1 month ago

@afitzgerrell Here are a couple of example files demonstrating the current code. I removed the ExclusiveZone from the template for now, to keep things as simple as possible. My guess is we can truncate the precision on the lat/lon values, and I'll also come up with some sort of algorithm to guide the point thinning. I did not compare the reprojected lat/lon values to the x,y values to confirm they're at least in the right hemisphere! Feel free to evaluate the spatial validity if you have time.

NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_DUCk.nc.json

NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_DUCk.nc.json

juliacollins commented 1 month ago

@afitzgerrell Two new files! My "avoid duplicate points" code still needs refining and testing, but let me know how this looks so far.

NSIDC0081_SEAICE_PS_S25km_20211101_v2.0_DUCk.nc.json

NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_DUCk.nc.json

juliacollins commented 1 month ago

Two new files for your review, @afitzgerrell ! Notes:

NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_DUCk.nc.json

NSIDC0081_SEAICE_PS_S25km_20211101_v2.0_DUCk.nc.json

juliacollins commented 1 month ago

@afitzgerrell the time string tests are included below, so you can see the current code behavior.

def test_time_defaults_to_zero():
    result = netcdf_reader.ensure_iso('2001-01-01')
    assert result == '2001-01-01T00:00:00+00:00'

def test_some_fractional_minutes_are_modified():
    result = netcdf_reader.ensure_iso('2001-01-01 18:59.99')
    assert result == '2001-01-01T18:59:59.990000+00:00'

def test_other_fractional_minutes_are_ignored():
    result = netcdf_reader.ensure_iso('2001-01-01 18:59.33')
    assert result == '2001-01-01T18:59:00.330000+00:00'

def test_fractional_seconds_are_retained():
    result = netcdf_reader.ensure_iso('2001-01-01 18:59:10.99')
    assert result == '2001-01-01T18:59:10.990000+00:00'

def test_assumes_zero_seconds_for_hhmm():
    result = netcdf_reader.ensure_iso('2001-01-01 18:59')
    assert result == '2001-01-01T18:59:00+00:00'

def test_correctly_formats_hhmmss():
    result = netcdf_reader.ensure_iso('2001-01-01 18:59:59')
    assert result == '2001-01-01T18:59:59+00:00'
afitzgerrell commented 1 month ago

Hi Julia. first things first, I’ve checked the latest umm-g files you attached and they plot perfectly, and as expected in terms of CCW vertex ordering.

I've also updated the umm-g guidelines to include not only Z as a UTC indicator, but also the +00:00 offset (lack of offset in this case). Everything I see regarding CMR indicates this should be valid since it’s claimed the ISO time standard is honored, and within ISO time standard, -/+hh:mm is valid to indicate any offset necessary as you noted in today’s standup. If this causes a problem during ingest, CMR’ got some ’splaining to do!

What I see in existing UMM-G metadata for "real" nsidc-0081 granules shows the microseconds field used to describe fractions. e.g., EndingDateTime": "2024-10-22T23:59:59.999Z" (of course here it's .999 rather than just .99 😐...and I don't think is of critical importance since the gist is simply to describe a day). Without my delaying things by chasing down an answer from someone I would suggest you do as you suggested irrespective of python's rendering being rather zealous with its trailing zeros. I.e., YES! to everything you said you've done in your third bullet above, and not leaving it as-is with fractional minutes.

Regarding your second bullet, I think for now we don't worry about you trying to undo the enthusiasm of zero-happy python (unless in running a test through ingest I find it fails solely because of 2024-10-22T23:59:59.990000+00:00 being present vs. 2024-10-22T23:59:59.99+00:00).

I like your thinking in the fourth bullet. If global attributes contain a time_coverage_duration attribute, and if I'm able to determine for certain that the time coordinate always defines the data start time, would it be easier overall to calculate the EndingDateTime value for umm-g from a file's time coordinate's value?

I'm wondering if MetGenC could as part of a nominal run check first for the presence of a time coordinate and a time_coverage_duration attribute to calculate EndingDateTime, but if it doesn't find one or the other (for whatever reason) it skips along to check for the presence of time_coverage_start & time_coverage_end? Or would this sort of functionality have a higher dev cost?

Regarding your comment noting current code behavior, this all looks sound to me!!

Apropos of nothing, I do NOT like when comments lack actual timestamps, as github does (and slack does to a certain extent until hours and reference to "yesterday" expire and it's forced to name a time and day of week) comments. I wish they were timestamped so in a multi-comment response, I can name the comment's date/time rather than " the comment tagged with '15 hours ago'" since that'll change by the time I actually hit the "Comment" button and further when you get around to reading this! pfffft!

juliacollins commented 1 month ago

@afitzgerrell I am in 1000% agreement with you on the time stamp issue! Drives me crazy. (Posted on Friday, October 25, 2024 at 12:36pm ;-) )

Kevin did indeed find the magic python package that handles "fractional minutes" time a wee bit better. The unit test cases now look like:

    pytest.param('2001-01-01', '2001-01-01T00:00:00.000+00:00', id="Date and no time"),
    pytest.param('2001-01-01 18:59:59', '2001-01-01T18:59:59.000+00:00', id="Date with time"),
    pytest.param('2001-01-01 18:59.5', '2001-01-01T18:59:30.000+00:00', id="Datetime and fractional minutes"),
    pytest.param('2001-01-01 18:59.500', '2001-01-01T18:59:30.000+00:00', id="Datetime and zero padded fractional minutes"),
    pytest.param('2001-01-01 18:59.34', '2001-01-01T18:59:20.000+00:00', id="Datetime and other fractional minutes value"),
    pytest.param('2001-01-01 18:59.999', '2001-01-01T18:59:59.000+00:00', id="Datetime and other fractional minutes value"),
    pytest.param('2001-01-01 18:59:20.666', '2001-01-01T18:59:20.666+00:00', id="Datetime and fractional seconds"),
    pytest.param('2001-01-01 18:59', '2001-01-01T18:59:00.000+00:00', id="Datetime and hours/minutes"),

The first argument in each line is the input time string, the second the expected output, and the third argument describes the example. He updated the test case to use a slicker bulk parameter input so it's a different format from my previous example, but hopefully the gist is clear.

Regarding your comment:

I like your thinking in the fourth bullet. If global attributes contain a time_coverage_duration attribute, and if I'm able to determine for certain that the time coordinate always defines the data start time, would it be easier overall to calculate the EndingDateTime value for umm-g from a file's time coordinate's value?

The NSIDC-0081 attributes do contain a time_coverage_duration global attribute along with the coverage start and end attributes. Using it isn't necessarily easier, but not tremendously difficult either if we decide that's a more robust strategy. For now I suggest we stick with what we have and keep the "start + duration" approach in mind for future use cases.

You also wrote:

I'm wondering if MetGenC could as part of a nominal run check first for the presence of a time coordinate and a time_coverage_duration attribute to calculate EndingDateTime, but if it doesn't find one or the other (for whatever reason) it skips along to check for the presence of time_coverage_start & time_coverage_end? Or would this sort of functionality have a higher dev cost?

This functionality would have some cost, but I think it's something we need to tackle once we're beyond the Minimum Viable Product stage. We'll need to enhance the assumption-checking process in general (ok, actually implement an assumption-checking process) and provide alerts if MetGenC can't find expected attributes. Definitely a long-term implementation goal in my mind.

The code for this issue has been merged into main and is ready to test. In general, you can run the process as before (now using mergenc rather than instameta), being sure to remove any existing UMM-G files first. More specifically, I suggest we set up a time for you, Kevin, and I to walk through the testing of the several different features recently added to the codebase.

afitzgerrell commented 1 month ago

24 October 2024: met as a group (Kevin, Julia, myself) and tested end-to-end ingest. Since then and into the week of 28 October 2024 end-to-end ingest was found to be successful after a fair bit of cajoling and troubleshooting to pinpoint where hiccups occurred and what the underlying causes were.

Said hiccups largely took place on the Cumulus / Cumulus Collection config side of things. On the DUCklings it was (regrettably, and we'd argue not with any particular logic applied) necessary to switch from using +00:00 to Z to indicate UTC time and modifying a "/" (deep in the code) being taken literally during ingest to be properly interpreted.

I've updated CNM and UMM-G guideline documents back to noting that only Z will work to indicate UTC, and removed all reference to +00:00 being an option.

I've reconfirmed that gpolygons are indeed added to umm-g files ✅, wherein the vertex order is anti-clockwise ✅, and matches existing nsidc-0081 granules spatial representation when viewed in UAT ✅ 🎉.

lisakaser commented 4 weeks ago

🎉 this end to end test was a big milestone. Great job all!