Closed juliacollins closed 4 weeks ago
@afitzgerrell Here are a couple of example files demonstrating the current code. I removed the ExclusiveZone
from the template for now, to keep things as simple as possible. My guess is we can truncate the precision on the lat/lon values, and I'll also come up with some sort of algorithm to guide the point thinning. I did not compare the reprojected lat/lon values to the x,y values to confirm they're at least in the right hemisphere! Feel free to evaluate the spatial validity if you have time.
@afitzgerrell Two new files! My "avoid duplicate points" code still needs refining and testing, but let me know how this looks so far.
Two new files for your review, @afitzgerrell ! Notes:
+00:00
style time zone information rather than Z
. Both should be correct, but if the data ingest people really want Z
we can update the time string handling (it does add complexity to the code, though, because I have yet to find a Python add-on that deals with time the way I want to deal with it).isoformat()
to generate a string that did not have a bunch of zero-padding in the microseconds value. This too can be changed at the cost of extra code complexity (unless somebody knows about the magical Python package that does this).'2021-11-01 23:59.99'
in the time_coverage_end
global attribute. I added code to change the fractional minutes to instead use a seconds value of 59
and put the 99
in the microseconds field. The output time value is thus 23:59:59.99
(which gets rendered by Python's isoformat
as 23.59.59.990000
). We could also leave it as-is, if the data ingest side of the process can handle fractional minutes.time_coverage_duration
global attribute to calculate the ending time coverage value (making the assumption that the coordinate data represent the beginning of the temporal coverage). It boils down to what assumptions we think are safest.@afitzgerrell the time string tests are included below, so you can see the current code behavior.
def test_time_defaults_to_zero():
result = netcdf_reader.ensure_iso('2001-01-01')
assert result == '2001-01-01T00:00:00+00:00'
def test_some_fractional_minutes_are_modified():
result = netcdf_reader.ensure_iso('2001-01-01 18:59.99')
assert result == '2001-01-01T18:59:59.990000+00:00'
def test_other_fractional_minutes_are_ignored():
result = netcdf_reader.ensure_iso('2001-01-01 18:59.33')
assert result == '2001-01-01T18:59:00.330000+00:00'
def test_fractional_seconds_are_retained():
result = netcdf_reader.ensure_iso('2001-01-01 18:59:10.99')
assert result == '2001-01-01T18:59:10.990000+00:00'
def test_assumes_zero_seconds_for_hhmm():
result = netcdf_reader.ensure_iso('2001-01-01 18:59')
assert result == '2001-01-01T18:59:00+00:00'
def test_correctly_formats_hhmmss():
result = netcdf_reader.ensure_iso('2001-01-01 18:59:59')
assert result == '2001-01-01T18:59:59+00:00'
Hi Julia. first things first, I’ve checked the latest umm-g files you attached and they plot perfectly, and as expected in terms of CCW vertex ordering.
I've also updated the umm-g guidelines to include not only Z as a UTC indicator, but also the +00:00 offset (lack of offset in this case). Everything I see regarding CMR indicates this should be valid since it’s claimed the ISO time standard is honored, and within ISO time standard, -/+hh:mm is valid to indicate any offset necessary as you noted in today’s standup. If this causes a problem during ingest, CMR’ got some ’splaining to do!
What I see in existing UMM-G metadata for "real" nsidc-0081 granules shows the microseconds field used to describe fractions. e.g., EndingDateTime": "2024-10-22T23:59:59.999Z" (of course here it's .999 rather than just .99 😐...and I don't think is of critical importance since the gist is simply to describe a day). Without my delaying things by chasing down an answer from someone I would suggest you do as you suggested irrespective of python's rendering being rather zealous with its trailing zeros. I.e., YES! to everything you said you've done in your third bullet above, and not leaving it as-is with fractional minutes.
Regarding your second bullet, I think for now we don't worry about you trying to undo the enthusiasm of zero-happy python (unless in running a test through ingest I find it fails solely because of 2024-10-22T23:59:59.990000+00:00 being present vs. 2024-10-22T23:59:59.99+00:00).
I like your thinking in the fourth bullet. If global attributes contain a time_coverage_duration attribute, and if I'm able to determine for certain that the time coordinate always defines the data start time, would it be easier overall to calculate the EndingDateTime value for umm-g from a file's time coordinate's value?
I'm wondering if MetGenC could as part of a nominal run check first for the presence of a time coordinate and a time_coverage_duration attribute to calculate EndingDateTime, but if it doesn't find one or the other (for whatever reason) it skips along to check for the presence of time_coverage_start & time_coverage_end? Or would this sort of functionality have a higher dev cost?
Regarding your comment noting current code behavior, this all looks sound to me!!
Apropos of nothing, I do NOT like when comments lack actual timestamps, as github does (and slack does to a certain extent until hours and reference to "yesterday" expire and it's forced to name a time and day of week) comments. I wish they were timestamped so in a multi-comment response, I can name the comment's date/time rather than " the comment tagged with '15 hours ago'" since that'll change by the time I actually hit the "Comment" button and further when you get around to reading this! pfffft!
@afitzgerrell I am in 1000% agreement with you on the time stamp issue! Drives me crazy. (Posted on Friday, October 25, 2024 at 12:36pm ;-) )
Kevin did indeed find the magic python package that handles "fractional minutes" time a wee bit better. The unit test cases now look like:
pytest.param('2001-01-01', '2001-01-01T00:00:00.000+00:00', id="Date and no time"),
pytest.param('2001-01-01 18:59:59', '2001-01-01T18:59:59.000+00:00', id="Date with time"),
pytest.param('2001-01-01 18:59.5', '2001-01-01T18:59:30.000+00:00', id="Datetime and fractional minutes"),
pytest.param('2001-01-01 18:59.500', '2001-01-01T18:59:30.000+00:00', id="Datetime and zero padded fractional minutes"),
pytest.param('2001-01-01 18:59.34', '2001-01-01T18:59:20.000+00:00', id="Datetime and other fractional minutes value"),
pytest.param('2001-01-01 18:59.999', '2001-01-01T18:59:59.000+00:00', id="Datetime and other fractional minutes value"),
pytest.param('2001-01-01 18:59:20.666', '2001-01-01T18:59:20.666+00:00', id="Datetime and fractional seconds"),
pytest.param('2001-01-01 18:59', '2001-01-01T18:59:00.000+00:00', id="Datetime and hours/minutes"),
The first argument in each line is the input time string, the second the expected output, and the third argument describes the example. He updated the test case to use a slicker bulk parameter input so it's a different format from my previous example, but hopefully the gist is clear.
Regarding your comment:
I like your thinking in the fourth bullet. If global attributes contain a time_coverage_duration attribute, and if I'm able to determine for certain that the time coordinate always defines the data start time, would it be easier overall to calculate the EndingDateTime value for umm-g from a file's time coordinate's value?
The NSIDC-0081 attributes do contain a time_coverage_duration
global attribute along with the coverage start and end attributes. Using it isn't necessarily easier, but not tremendously difficult either if we decide that's a more robust strategy. For now I suggest we stick with what we have and keep the "start + duration" approach in mind for future use cases.
You also wrote:
I'm wondering if MetGenC could as part of a nominal run check first for the presence of a time coordinate and a time_coverage_duration attribute to calculate EndingDateTime, but if it doesn't find one or the other (for whatever reason) it skips along to check for the presence of time_coverage_start & time_coverage_end? Or would this sort of functionality have a higher dev cost?
This functionality would have some cost, but I think it's something we need to tackle once we're beyond the Minimum Viable Product stage. We'll need to enhance the assumption-checking process in general (ok, actually implement an assumption-checking process) and provide alerts if MetGenC can't find expected attributes. Definitely a long-term implementation goal in my mind.
The code for this issue has been merged into main
and is ready to test. In general, you can run the process as before (now using mergenc
rather than instameta
), being sure to remove any existing UMM-G files first. More specifically, I suggest we set up a time for you, Kevin, and I to walk through the testing of the several different features recently added to the codebase.
24 October 2024: met as a group (Kevin, Julia, myself) and tested end-to-end ingest. Since then and into the week of 28 October 2024 end-to-end ingest was found to be successful after a fair bit of cajoling and troubleshooting to pinpoint where hiccups occurred and what the underlying causes were.
Said hiccups largely took place on the Cumulus / Cumulus Collection config side of things. On the DUCklings it was (regrettably, and we'd argue not with any particular logic applied) necessary to switch from using +00:00 to Z to indicate UTC time and modifying a "/" (deep in the code) being taken literally during ingest to be properly interpreted.
I've updated CNM and UMM-G guideline documents back to noting that only Z will work to indicate UTC, and removed all reference to +00:00 being an option.
I've reconfirmed that gpolygons are indeed added to umm-g files ✅, wherein the vertex order is anti-clockwise ✅, and matches existing nsidc-0081 granules spatial representation when viewed in UAT ✅ 🎉.
🎉 this end to end test was a big milestone. Great job all!
Issue #45 stubbed out the geometry section of the UMM-G output with hard-coded values. Add the necessary logic to pull geospatial information from a netCDF file and add it to the UMM-G output. The goal is to handle the geospatial metadata generation automatically, not with an override file created by the operator.
Acceptance criteria: