Closed djhoese closed 2 years ago
How about:
file_start_time
and file_end_time
for those datasets with filenames containing these times. For Himawari we'd have 2022-01-31 12:00:00
as file_start_time and no file_end_time, for example.scheduled_start_time
and scheduled_end_time
for datasets that have these defined (usually, but not always, the same as the previous times I suspect). For himawari these would be 2022-01-31 12:00:00
and `2022-01-31 12:10:00
for the start and end times respectively.actual_start_time
and actual_end_time
for the times that scanning actually began and ended. For Himawari these would be something like 2022-01-31 12:00:20.022
and 2022-01-31 12:09:54.303
.forecast_time
as you suggest.I like this idea. My suggestion would be to default to the file_start_time
or scheduled_start_time
rather than actual_start_time
as this is more likely to be consistent across all datasets in a composite. If we're making changes here I'd also suggest, as you hint at, that we use an average of the start and end times for calculating the solar angles. Using the start time alone will produce some non-negligible differences for the long scanning sensors (full orbits, or the older GEOs with 30 min scans).
filename_start_time
. Otherwise people will think it is the time that is in the file. I think we've generally fallen into the practice of end times defaulting to the start time when end is not available. I don't really like "actual" too much, but I also don't know what is common for data producers to use in their files. I've always thought "observation" sounded more accurate and descriptive. Lastly, I think in a lot of readers we could assume that filename_start_time
is scheduled_start_time
, but then again readers don't have to define all of these things.Regarding averaging, I should point out that PR #473 does averaging in the combine_metadata
function for any datetime fields. This is used for multi-segment data formats like AHI HSD. I know you're talking about averaging start/end to get a nominal observation time for a scene but still wanted to mention it.
in order to cover the potential use cases i give my thumbs up for various suggested start and end times. i'm ok with the suggested naming of filename_start_time
, observation_start_time
(which would be redundant to start_time
, right?), and scheduled_start_time
. also, observation_start_time
sounds more descriptive/accurate than actual_start_time
.
start_time
is the generic "the user needs any time to refer to this data, it doesn't matter what one" value. It could be any of the above mentioned times and at any accuracy BUT it should be "scheduled" time when available. Or put another way, start_time
should be whatever time is most consistent between like-datasets from the same sensor/processing suite.
Good initiative, thanks for starting this! I like the proposed triplet (scheduled_start_time, filename_start_time, observation_start_time)
and a generic start_time
Just a note here: if we start using cloud data, filename
might become obsolete...
Overall I agree with the principle. I would just like to make sure that when a coarser time is used for eg angle computation that the user is clearly notified of it, with an explanation/reference/link on how to change that setting.
So in the (scheduled_start/end_time, filename_start/end_time, observation_start/end_time)
model, how are we covering the "repeat cycle" start and end times? E.g. for SEVIRI 12:00:00-12:15:00 against 12:00:13-12.12.43 of the scanning times.
From the discussion above, I understand that those would be the scheduled
times?
If yes, maybe scheduled
is not the right word, as it could be interpreted as "the time of the scheduled/planned acquisition, that differs from the real acquisition because of some (small) instrument/scheduling delays".
For example in the SEVIRI header, we actually have "planned_forward_scan_end" times defined, that differ only by some milliseconds from the true observation end times. Would these times then be the scheduled_end_time
? In that case, we don't have a time representing e.g. 12:15 anymore.
In other words, the mismatch between e.g. SEVIRI's and FCI's scanning end times and repeat cycles end times are not due to "scheduling", but are caused by the instruments needing some time to retrace the scanner back to the start position (plus calibration operations).
So, in my opinion, we need to have a repeat_cycle
/slot
time (maybe slot
works best for both GEO and LEO), either replacing or alongside scheduled
, to make it unequivocal.
@mraspaud Good point. I'm not sure if there is a good name otherwise. Like uri_start_time
still assumes there is something in a URL or Unique Identifier that represents the time. I suppose in the case of a URL API there wouldn't be any equivalent to a filename
time. I guess a time in the query, but I'd almost argue if that was somehow included in the result that it should be its own thing (ex. query_start_time
).
@ameraner You'll have to forgive me, but I'm not super familiar with the difference in the terms you're using. In the SEVIRI or FCI case, if we simplify it and say the instrument should start a scanning a new image every 15 minutes, I would say the scheduled_start_time
would be 12:03:00 and scheduled_end_time
be 12:18:00 (example times to show that it doesn't have to be at the start of the hour). However, I can see an argument for the scheduled_end_time
being set to 12:16:00 if the actual physical scanning takes less than 15 minutes to complete. Or is this the discussion that we're having? The "slot" of time that the instrument has to make the observation versus the time range it actually takes to record that observation? Which one is "repeat_cycle"/"repeat_slot"?
Or is this the discussion that we're having? The "slot" of time that the instrument has to make the observation versus the time range it actually takes to record that observation?
Exactly. Those two times can differ by several minutes. In the SEVIRI world, for example, most users would consider one SEVIRI image to be "valid" for 15 minutes (the time period that passes between the repetitions of the acquisition, hence commonly called "repeat cycle", or "slot" in my argumentation), even tho the actual scanning takes only 12 minutes. So for 3 minutes there is an acquisition gap, where the instrument is not recording anything, while preparing to start the next acquisition.
Any visualisation tool showing SEVIRI images, would keep one SEVIRI image on screen for 15 minutes, disregarding the 3mins acquisition gap. This is what Sauli was referring to on Slack, talking about SIFT:
The newly developed
TimelineManager
in SIFT would need to know which FCI/SEVIRI data to show in the background of frequently updating LI data (every 30s or a minute). If we use the actualend_time
there will be a gap between 12:13 and 12:15 as the is no "valid" data for this time period and no background data would be displayed.
--> hence the need for Satpy to provide these gapless slot
start and end times.
On the other hand, any user e.g. wanting to compare precisely SEVIRI against another instrument, would only consider the actual scanning period - hence the need to provide the observation
times.
Now I'm confused about what we're confused about. In this SEVIRI discussion there are 2 time ranges, right? We have the "scheduled" or "repeat cycle" time which for SEVIRI would be 15 minutes and would include "nominal" times where by nominal I mean the timing the instrument was trying to meet (within this 15 minutes). We also have the "observation" time where the data actually represents the Earth between the first/start time and the stops at the second/end time.
So @ameraner you originally said:
From the discussion above, I understand that those would be the scheduled times?
If yes, maybe scheduled is not the right word, as it could be interpreted as "the time of the scheduled/planned acquisition, that differs from the real acquisition because of some (small) instrument/scheduling delays".
My answer is yes, that's exactly what scheduled time is. The observation time is this acquisition time.
Or...are you saying there is the scheduled time (the "pretty" human-readable time 12:03:00), the repeat cycle time (12:03:15.444 to 12:18:15.444), and the observation time (12:03:33.677 to 12:15:24.323)? Where repeat cycle time is the overall 15 minute time slot of the observation but observation time is the actual time range that data was being recorded (shifted from slot time because of additional hardware movement or calibration).
I feel like we're slowly converging on the concepts, with still some nomenclature misunderstandings.
Maybe with a full example what I mean becomes more clear: In the SEVIRI case, we end up with 3 times
12:00:00.000-12:15:00.000 -> "repeat cycle" time or "slot" time: the time period allocated for this acquisition, marks the total validity period of the data
12:00:13.000-12.12.43.102 -> "scheduled" time, as synonym for planned time, the foreseen acquisition/scanning period in the satellite planning by the data provider
12:00:13.000-12:12:43.145-> "observation" time, the actual recorded scanning time. Differs slightly from the scheduled time due to an instrument delay
the next image could have
12:15:00.000-12:30:00.000 ->"repeat cycle" time or "slot" time: note: no gap between this start time and the end time of the previous image
12:15:12.100-12.27.41.223 -> "scheduled" time
12:15:12.103-12:27:41.225-> "observation" time
So the only pretty time is the slot
time, as it describes the pre-defined time frame for each acquisition cycle. The observation time is the one of scientific interest. The scheduled time, defined as the time of the planned acquisition by the data provider, is probably not of much use, as it's only related to the data provider operations.
In the first comments, I think, what I call here "slot" time was being referred to as "scheduled". My whole point is that the word "scheduled" can be misunderstood to refer to the (quite useless) planning time as described above. Using slot
instead signalises that we are talking about a larger, predefined, abstract time period that contains the observation.
Maybe I'm overthinking this as I'm too influenced by the specific nomenclature and timing information of SEVIRI. If this still doesn't make sense I'll be able to sleep ok also with "scheduled" 😄
@ameraner Makes sense. I agree that the "planned time" is not something most of us probably deal with and was not something I had considered. I'm going to ask around in the GOES-R folks and see what other terminology is used for these kinds of things. Maybe we can land on something that is clearer than "scheduled" time.
One last question, I assume that all SEVIRI file formats provide all 3 time ranges you've talked about?
The observation start/end times and the scheduled end time are explicitly in the files. The planned start time is not given (because by the time the file is created, the "true" observation start time is known). The repeat cycle times are not there either, we always have to calculate them manually by rounding the observation times.
The repeat cycle start_time
is in the HRIT filenames, though.
So I brought this up with some NOAA folks and some other people at SSEC and the general feeling was that "slot" by itself is a confusing word since people tend to think of orbital slots of satellites. I got generally positive feedback for "scheduled" for the human-friendly repeat cycle time. Some people pointed out that the difference between the "planned" time and the observation time is so small that it won't have an effect on anything data analysis wise (that's my phrasing/understanding). It was also brought up that the difference between a scheduled time and an observation time, even for angle generation, isn't going to make a huge difference in generating pretty pictures.
If the SEVIRI files don't have the "planned" time in them then I say we go with "scheduled_start_time" and "observation_start_time" (and their "end" counterparts). You had mentioned @ameraner that you could at least sleep OK with that decision.
We still need to think about "filename" versus something else (@mraspaud any other ideas come up in the last couple days?). Another related piece of metadata we could consider including is the "repeat cycle" as a timedelta for the duration of the scheduled time, but I suppose that's redundant.
By orbital slot
they meant the equatorial crossing time for the polar orbiters, or the sub-satellite longitude for GEO satellites? Never heard either referred as slot
before.
I tend to use slot
to mean the longitude box that a GEO sat is assigned to. For the scanning I use timeslot
.
@ameraner said this:
The repeat cycle times are not there either, we always have to calculate them manually by rounding the observation times.
Off topic but I don't think that's totally correct. At least for the native L1.5 data there's the TrueRepeatCycleStart
and PlannedRepeatCycleEnd
attributes. For a random timeslot I just looked at these are listed as datetime.datetime(2022, 2, 10, 8, 45, 9, 856897)
and datetime.datetime(2022, 2, 10, 9, 0, 9, 696503)
respectively.
Regarding @djhoese's comments:
Some people pointed out that the difference between the "planned" time and the observation time is so small that it won't have an effect on anything data analysis wise
That's true for the NOAA sats but might not be true for other sats. INSAT-3DR, for example, often has a minute or two gap between planned and actual times. Point being, if we have both planned and actual listed in the file, we should also include them in the attributes from the reader whenever possible.
On the substance of the discussion: I like scheduled_{type}_time
but am a bit uncomfortable with observation_{type}_time
. This implies that the sensor ceases observation at the observation_end_time
, which isn't true for SEVIRI or AGRI. But it's a minor issue that's only relevant for a small subset of people, so if there's general agreement on observation
then I won't make a fuss :)
This implies that the sensor ceases observation at the observation_end_time,
Doesn't it? None of the image data in question was recorded/observed after observation_end_time
.
I've started playing around with this idea with AHI HSD to see how it effected performance. Looks like using scheduled time for start_time
which effects the sunz_corrected
modifier kept memory about the same when generating a true_color
, but it cut 50 seconds off the overall execution time.
Interestingly I also tried setting generate=False
in the Scene.load
call with these changes and it did as I expected and only generated the coszen
once (one resampled resolution) and dropped ~2GB from the memory usage but overall execution time was the same.
This discussion will probably happen on slack (as it has been so far today), but the current suggestions are:
nominal_start/end_time
: Example, Human-friendly 15 minute SEVIRI intervalobservation_start/end_time
: The actual time range that data was observed/recordedAdditionally @mraspaud started the discussion on putting these time ranges into a sub-dictionary similar to orbital_parameters
, something like:
data_arr.attrs['time_parameters']['observation_start_time']
Anyone have a better name that time_parameters
?
time_parameters
sub-dictionary. As I mentioned as my last comment in #2031, if anyone has an issue with this naming or design please speak sooner rather than later.
Feature Request
Is your feature request related to a problem? Please describe. See #2010, #1461, #1384, and #473 for related discussions. In Satpy, we have standardized that readers should provide a
start_time
and optionally anend_time
to define the time range of the data being loaded. However, this is often not the only piece of time information we have. For example, geostationary satellites which have a nominal schedule for their observations will have the time the data was supposed to be recorded and the time it actually was recorded. This has become an issue in things like AHI HRIT/HSD performance where thestart_time
was set to the observation time of the data and differed between each band. This results in things like solar and sensor angles being calculated separately for each band even though they represent the same "scene" of space/time. While it may be more accurate to use the observation time, a better performance can be achieved if the scheduled/nominal time is used.Describe the solution you'd like I propose two things be added/changed to Satpy:
start_time
andend_time
as general definitions for the time range of the data and will actually be one of the following time fields. Additionally I think we should have ascheduled_start_time
,scheduled_end_time
,observation_start_time
, andobservation_end_time
. I believe we already have some non-defined standard for scan times? That would be another good parameter to allow. We could also have aforecast_time
or amodel_time
for model data to distinguish when the processing was run/started and what time it is forecasting for (I don't deal with this data much so tell me when I'm being wrong). I think as part of this we should say thatstart_time
should be equal toobservation_start_time
when possible for consistency across readers and better performance.satpy.config
parameter for telling angle generation what time field to use. A field likeangle_time_reference
which can be set to the metadata key name for the time to use. It would default tostart_time
, but could be set to observation_start_time or scan_start_time. Thinking about this more, if the angle generation was updated to handle a range of times (like interpolated between start and end) then maybe this key should be eitherobservation
,scan
, orscheduled
, but always default tostart_time/end_time
.Describe any changes to existing user workflow Anyone using the reader metadata of
start_time
andend_time
may have slight differences (hopefully only small) in their calculations. Otherwise, the workflow should be unchanged except for those users who care about the accuracy of the time.Additional context Keyword arguments to the readers would be an option like in #1384, but I'm realizing now that this prevents all the information being provided to the user which is worse than choosing what information to provide.