Closed pjhartzell closed 2 years ago
Thank you for the report. I'm very sorry about the issues, which did not appear in the limited set of examples that I had access to. I've now included all the examples files in the test suite, fixed the issues as described below and will commit to GitHub shortly. Please see below what the (temporary) solutions look like. We should contact NOAA for clarification. Would you do that or shall I?
Sometimes it seems the frame times are missing. I'll leave them out of the geoparquet now if they are missing. Still, we should ask NOAA whether that is intentional.
This means for some variables that reported count (e.g. in event_count) is not reflecting the number of rows in the variables (e.g. event_id). For now, I'm using the count value by default but if it doesn't match I'm falling back to the number of rows that are actually available and raise a warning. Still, we should ask NOAA whether this is intentional and what we should do in such a case.
It seems for some time offsets a None/NaN/null value is reported. For now, I'm simply writing "null" to the geoparquet file, too. I'm also adding a short description to the column so that users are aware. Still, we should ask NOAA whether that is intentional and/or has any implications.
Some files report GEOS_Test instead of GOES_West/East. The example file is from before the satellite has "finished drifting"
GOES-S was launched March 1, 2018. GOES-17 began drifting to its GOES-West operational location of 137.2 degrees west longitude on October 24, 2018. Drift was completed on November 13, 2018 and nominal operations resumed on November 15, 2018. The satellite was declared GOES-West on February 12, 2019. The change from 137.0 to 137.2 was made for operational efficiency and to minimize impacts from other geostationary satellites.
Your example is from October 2018. I'm not sure how to handle them as the geometry will likely not be the geometry that we have pre-defined for the final position of both satellites. I've added some fallback code, but that only applies for after the drifting period, so not sure how to handle the files beforehand. All code right now relies on the final positions. Something to ask NOAA about?
This happens when no events/flashed occured in the timeframe. Will be fixed.
Thanks for the quick updates! Yeah, we definitely could have supplied more examples. But this is pretty standard. Odds and ends turn up once the full dataset starts to be processed.
I'm good with the changes. I agree that the items from when the satellite was drifting (GOES_Test) are an issue. Perhaps a warning about potentially bad geometry in the Item description? Something to discuss with @gadomski later today, along with potential contact with NOAA.
Shall I re-release or do we wait for the NOAA feedback?
Let's wait until we get fixes for all the above issues, then release. If @pjhartzell needs to use this package with only some of the fixes, he can pin to a commit.
@gadomski found some docs relating to the first error: KeyError: 'flash_frame_time_offset_of_first_event'. Evidently there was a version change (the version is not indicated anywhere) that added the three variables containing ..._frame_time_offset_...
in their names. Older files will have 45 variables; newer files will have 48.
How to Determine If Data is DO.07.00.00 or a Prior Version: There are no external indications
in the file that the format has changed. One way to determine which version of the file you are
working with is to count the number of variables in the file. If it is 45, it is the OLD format, will not
have the SUVs (but will still have the “_Unsigned” designation for the non-standard unsigned
integers). If the variable count is 48, it is the DO.07.00.00 version of the file and will have the list
of SUVs (above) along with the “_Unsigned” designation for the non-standard unsigned integers.
The table above this snippet in the doc lists the names of the variable that were added in version DO.07.00.00.
parquet.py
handles the issue. But not sure if you want to change the way it is handled or update the logger warning now that we have some docs on what is going on.
Thanks a lot, yeah I will make a small change and make the tests a bit more precise!
I've just made the changes. Any news from NOAA regarding the other points? @gadomski
Also in the doc that @gadomski found are warnings about missing _Unsigned
attributes on some variables. This is the cause of the TypeError: unsupported type for timedelta seconds component: NoneType errors. There are unsigned integer bytes being incorrectly read as signed integers. Note that the netCDF4 library honors the _Unsigned
attribute if it exists.
We might need to check each variable that is supposed to have an _Unsigned
attribute to verify it exists. If it doesn't, some modifications are necessary to obtain the correct values. Here's a gist with a notebook that provides a link to a table that lists which variables should have an _Unsigned
attribute and two approaches that I've found to correctly read the variable values when the _Unsigned
attribute is missing.
I haven't found anything for the IndexError: index exceeds dimension bounds error. I think your method of overriding the provided count
with the actual count when they disagree is good for now.
Ah... interesting. I checked the values with Panoply and assumed that what I get out there is actually correct and based my fix on this, which it seems it is not because the file is effectively defect due to the missing unsigned indicator. I'll look into this. I'm wondering whether it should be our responsibility to fix defect files? Users of the netCDF files that read it with "normal software" like Panoply will also get nan values and as such may get different results based on whether they use the "fixed" geoparquet or the "defect" netCDF files. That's somehow weird. (I assume normal users would not read these guide books / PDFs from NOAA.)
I assume normal users would not read these guide books / PDFs from NOAA.
Yeah, I agree that this is a hidden bug and most users will struggle to find it and fix it. @gadomski, your thoughts on repairing the netCDF files? I hate to modify source files, but it definitely creates a disconnect between the corrected data in the parquet files and the defective source file.
In principle, we could simply provide an option for this. See #18 for details.
So going through the issues again, I think we were able to solve them somewhat and there are two that could be sent to NOAA for clarification. Thanks for the help Pete and Preston.
Agreed that we can still follow up with NOAA about the index exceeding dimension bounds.
Some thoughts on the drift period, the "GOES-Test orbital_slot
, and use of bbox for geometry:
I looked in the data holdings and no netCDF files exist for the drift time periods noted and handled in the code. There is no GOES-16 data in 2017 and there is a gap in the GOES-17 data from October 25, 2018 to November 12, 2018 (inclusive). So that logic could be removed.
All GOES-17 files prior to the drift period use the "GOES-Test" orbital slot and have constant values for the lat_field_of_view_bounds
and lon_field_of_view_bounds
variables:
lat = [ 66.56 -66.56]
lon = [-156.06 -22.94]
Since we are using the bbox for the geometry, I think we have what we need to produce Items for the GOES-17 data that is in the GOES-Test orbital slot.
All GOES-16 and GOES-17 files after their respective drift periods use the GOES-East and GOES-West orbital slots, respectively.
Thanks for verifying with the data holding that the fallback code is not needed. I removed/changed it to use the platform for identifying the actual slot in case it Goes-Test.
So this issue is basically finished except that we could still ask NOAA about the remaining questions, right?
we could still ask NOAA about the remaining questions, right?
I believe the only remaining question is the "IndexError: index exceeds dimension bounds" question. I'll tag @gadomski on whether that's worth following up with NOAA.
I opened a PR with some edits on how I think the Test orbital slot should be handled. Let me know your thoughts.
I'll review the 0.2.0 release PR once we finish this up.
Opened a new issue for the issue that reported counts are not reflecting the number of rows in the variables: #22 So now we can close this issue which got a bit hard to follow.
Seeing some item creation failures. I randomly sampled the Planetary Computer's holdings for 1000 NetCDFs and cataloged the error tracebacks in the attached file: log_summary.txt
It looks like there are some inconsistencies in the NetCDF content in up to 10% of the sampled files. I've bundled the example NetCDF files listed in the log_summary in the attached zip file: example_files.zip. The NetCDF files are also publicly available.