djhoese commented 2 years ago

Feature Request

Is your feature request related to a problem? Please describe. See #2010, #1461, #1384, and #473 for related discussions. In Satpy, we have standardized that readers should provide a start_time and optionally an end_time to define the time range of the data being loaded. However, this is often not the only piece of time information we have. For example, geostationary satellites which have a nominal schedule for their observations will have the time the data was supposed to be recorded and the time it actually was recorded. This has become an issue in things like AHI HRIT/HSD performance where the start_time was set to the observation time of the data and differed between each band. This results in things like solar and sensor angles being calculated separately for each band even though they represent the same "scene" of space/time. While it may be more accurate to use the observation time, a better performance can be achieved if the scheduled/nominal time is used.

Describe the solution you'd like I propose two things be added/changed to Satpy:

A defined set of names for possible times that a reader could produce. We will continue to use start_time and end_time as general definitions for the time range of the data and will actually be one of the following time fields. Additionally I think we should have a scheduled_start_time, scheduled_end_time, observation_start_time, and observation_end_time. I believe we already have some non-defined standard for scan times? That would be another good parameter to allow. We could also have a forecast_time or a model_time for model data to distinguish when the processing was run/started and what time it is forecasting for (I don't deal with this data much so tell me when I'm being wrong). I think as part of this we should say that start_time should be equal to observation_start_time when possible for consistency across readers and better performance.
The other thing is a satpy.config parameter for telling angle generation what time field to use. A field like angle_time_reference which can be set to the metadata key name for the time to use. It would default to start_time, but could be set to observation_start_time or scan_start_time. Thinking about this more, if the angle generation was updated to handle a range of times (like interpolated between start and end) then maybe this key should be either observation, scan, or scheduled, but always default to start_time/end_time.

Describe any changes to existing user workflow Anyone using the reader metadata of start_time and end_time may have slight differences (hopefully only small) in their calculations. Otherwise, the workflow should be unchanged except for those users who care about the accuracy of the time.

Additional context Keyword arguments to the readers would be an option like in #1384, but I'm realizing now that this prevents all the information being provided to the user which is worse than choosing what information to provide.

simonrp84 commented 2 years ago

How about:
- file_start_time and file_end_time for those datasets with filenames containing these times. For Himawari we'd have 2022-01-31 12:00:00 as file_start_time and no file_end_time, for example.
- scheduled_start_time and scheduled_end_time for datasets that have these defined (usually, but not always, the same as the previous times I suspect). For himawari these would be 2022-01-31 12:00:00 and `2022-01-31 12:10:00 for the start and end times respectively.
- actual_start_time and actual_end_time for the times that scanning actually began and ended. For Himawari these would be something like 2022-01-31 12:00:20.022 and 2022-01-31 12:09:54.303.
- We could then have other options such as forecast_time as you suggest.
I like this idea. My suggestion would be to default to the file_start_time or scheduled_start_time rather than actual_start_time as this is more likely to be consistent across all datasets in a composite. If we're making changes here I'd also suggest, as you hint at, that we use an average of the start and end times for calculating the solar angles. Using the start time alone will produce some non-negligible differences for the long scanning sensors (full orbits, or the older GEOs with 30 min scans).

djhoese commented 2 years ago

I like the idea of a file time, but we may have to be clearer and say filename_start_time. Otherwise people will think it is the time that is in the file. I think we've generally fallen into the practice of end times defaulting to the start time when end is not available. I don't really like "actual" too much, but I also don't know what is common for data producers to use in their files. I've always thought "observation" sounded more accurate and descriptive. Lastly, I think in a lot of readers we could assume that filename_start_time is scheduled_start_time, but then again readers don't have to define all of these things.
I would definitely have it default to the scheduled time. We could document a preference for it, but I think that would be up to the individual satpy components that use the data (rayleigh could do something different than sunz correction).

Regarding averaging, I should point out that PR #473 does averaging in the combine_metadata function for any datetime fields. This is used for multi-segment data formats like AHI HSD. I know you're talking about averaging start/end to get a nominal observation time for a scene but still wanted to mention it.

sjoro commented 2 years ago

in order to cover the potential use cases i give my thumbs up for various suggested start and end times. i'm ok with the suggested naming of filename_start_time, observation_start_time (which would be redundant to start_time, right?), and scheduled_start_time. also, observation_start_time sounds more descriptive/accurate than actual_start_time.

djhoese commented 2 years ago

start_time is the generic "the user needs any time to refer to this data, it doesn't matter what one" value. It could be any of the above mentioned times and at any accuracy BUT it should be "scheduled" time when available. Or put another way, start_time should be whatever time is most consistent between like-datasets from the same sensor/processing suite.

sfinkens commented 2 years ago

Good initiative, thanks for starting this! I like the proposed triplet (scheduled_start_time, filename_start_time, observation_start_time) and a generic start_time

mraspaud commented 2 years ago

Just a note here: if we start using cloud data, filename might become obsolete... Overall I agree with the principle. I would just like to make sure that when a coarser time is used for eg angle computation that the user is clearly notified of it, with an explanation/reference/link on how to change that setting.

ameraner commented 2 years ago

So in the (scheduled_start/end_time, filename_start/end_time, observation_start/end_time) model, how are we covering the "repeat cycle" start and end times? E.g. for SEVIRI 12:00:00-12:15:00 against 12:00:13-12.12.43 of the scanning times. From the discussion above, I understand that those would be the scheduled times?

If yes, maybe scheduled is not the right word, as it could be interpreted as "the time of the scheduled/planned acquisition, that differs from the real acquisition because of some (small) instrument/scheduling delays". For example in the SEVIRI header, we actually have "planned_forward_scan_end" times defined, that differ only by some milliseconds from the true observation end times. Would these times then be the scheduled_end_time? In that case, we don't have a time representing e.g. 12:15 anymore. In other words, the mismatch between e.g. SEVIRI's and FCI's scanning end times and repeat cycles end times are not due to "scheduling", but are caused by the instruments needing some time to retrace the scanner back to the start position (plus calibration operations).

So, in my opinion, we need to have a repeat_cycle/slot time (maybe slot works best for both GEO and LEO), either replacing or alongside scheduled, to make it unequivocal.

djhoese commented 2 years ago

@mraspaud Good point. I'm not sure if there is a good name otherwise. Like uri_start_time still assumes there is something in a URL or Unique Identifier that represents the time. I suppose in the case of a URL API there wouldn't be any equivalent to a filename time. I guess a time in the query, but I'd almost argue if that was somehow included in the result that it should be its own thing (ex. query_start_time).

@ameraner You'll have to forgive me, but I'm not super familiar with the difference in the terms you're using. In the SEVIRI or FCI case, if we simplify it and say the instrument should start a scanning a new image every 15 minutes, I would say the scheduled_start_time would be 12:03:00 and scheduled_end_time be 12:18:00 (example times to show that it doesn't have to be at the start of the hour). However, I can see an argument for the scheduled_end_time being set to 12:16:00 if the actual physical scanning takes less than 15 minutes to complete. Or is this the discussion that we're having? The "slot" of time that the instrument has to make the observation versus the time range it actually takes to record that observation? Which one is "repeat_cycle"/"repeat_slot"?

ameraner commented 2 years ago

Or is this the discussion that we're having? The "slot" of time that the instrument has to make the observation versus the time range it actually takes to record that observation?

Exactly. Those two times can differ by several minutes. In the SEVIRI world, for example, most users would consider one SEVIRI image to be "valid" for 15 minutes (the time period that passes between the repetitions of the acquisition, hence commonly called "repeat cycle", or "slot" in my argumentation), even tho the actual scanning takes only 12 minutes. So for 3 minutes there is an acquisition gap, where the instrument is not recording anything, while preparing to start the next acquisition.

Any visualisation tool showing SEVIRI images, would keep one SEVIRI image on screen for 15 minutes, disregarding the 3mins acquisition gap. This is what Sauli was referring to on Slack, talking about SIFT:

The newly developed TimelineManager in SIFT would need to know which FCI/SEVIRI data to show in the background of frequently updating LI data (every 30s or a minute). If we use the actual end_time there will be a gap between 12:13 and 12:15 as the is no "valid" data for this time period and no background data would be displayed.

--> hence the need for Satpy to provide these gapless slot start and end times.

On the other hand, any user e.g. wanting to compare precisely SEVIRI against another instrument, would only consider the actual scanning period - hence the need to provide the observation times.

djhoese commented 2 years ago

Now I'm confused about what we're confused about. In this SEVIRI discussion there are 2 time ranges, right? We have the "scheduled" or "repeat cycle" time which for SEVIRI would be 15 minutes and would include "nominal" times where by nominal I mean the timing the instrument was trying to meet (within this 15 minutes). We also have the "observation" time where the data actually represents the Earth between the first/start time and the stops at the second/end time.

So @ameraner you originally said:

From the discussion above, I understand that those would be the scheduled times?

If yes, maybe scheduled is not the right word, as it could be interpreted as "the time of the scheduled/planned acquisition, that differs from the real acquisition because of some (small) instrument/scheduling delays".

My answer is yes, that's exactly what scheduled time is. The observation time is this acquisition time.

Or...are you saying there is the scheduled time (the "pretty" human-readable time 12:03:00), the repeat cycle time (12:03:15.444 to 12:18:15.444), and the observation time (12:03:33.677 to 12:15:24.323)? Where repeat cycle time is the overall 15 minute time slot of the observation but observation time is the actual time range that data was being recorded (shifted from slot time because of additional hardware movement or calibration).

ameraner commented 2 years ago

I feel like we're slowly converging on the concepts, with still some nomenclature misunderstandings.

Maybe with a full example what I mean becomes more clear: In the SEVIRI case, we end up with 3 times

12:00:00.000-12:15:00.000 -> "repeat cycle" time or "slot" time: the time period allocated for this acquisition, marks the total validity period of the data
12:00:13.000-12.12.43.102 -> "scheduled" time, as synonym for planned time, the foreseen acquisition/scanning period in the satellite planning by the data provider
12:00:13.000-12:12:43.145-> "observation" time, the actual recorded scanning time. Differs slightly from the scheduled time due to an instrument delay

the next image could have

12:15:00.000-12:30:00.000 ->"repeat cycle" time or "slot" time: note: no gap between this start time and the end time of the previous image
12:15:12.100-12.27.41.223 -> "scheduled" time
12:15:12.103-12:27:41.225-> "observation" time

So the only pretty time is the slot time, as it describes the pre-defined time frame for each acquisition cycle. The observation time is the one of scientific interest. The scheduled time, defined as the time of the planned acquisition by the data provider, is probably not of much use, as it's only related to the data provider operations.

In the first comments, I think, what I call here "slot" time was being referred to as "scheduled". My whole point is that the word "scheduled" can be misunderstood to refer to the (quite useless) planning time as described above. Using slot instead signalises that we are talking about a larger, predefined, abstract time period that contains the observation.

Maybe I'm overthinking this as I'm too influenced by the specific nomenclature and timing information of SEVIRI. If this still doesn't make sense I'll be able to sleep ok also with "scheduled" 😄

djhoese commented 2 years ago

@ameraner Makes sense. I agree that the "planned time" is not something most of us probably deal with and was not something I had considered. I'm going to ask around in the GOES-R folks and see what other terminology is used for these kinds of things. Maybe we can land on something that is clearer than "scheduled" time.

One last question, I assume that all SEVIRI file formats provide all 3 time ranges you've talked about?

ameraner commented 2 years ago

The observation start/end times and the scheduled end time are explicitly in the files. The planned start time is not given (because by the time the file is created, the "true" observation start time is known). The repeat cycle times are not there either, we always have to calculate them manually by rounding the observation times.

pnuu commented 2 years ago

The repeat cycle start_time is in the HRIT filenames, though.

djhoese commented 2 years ago

So I brought this up with some NOAA folks and some other people at SSEC and the general feeling was that "slot" by itself is a confusing word since people tend to think of orbital slots of satellites. I got generally positive feedback for "scheduled" for the human-friendly repeat cycle time. Some people pointed out that the difference between the "planned" time and the observation time is so small that it won't have an effect on anything data analysis wise (that's my phrasing/understanding). It was also brought up that the difference between a scheduled time and an observation time, even for angle generation, isn't going to make a huge difference in generating pretty pictures.

If the SEVIRI files don't have the "planned" time in them then I say we go with "scheduled_start_time" and "observation_start_time" (and their "end" counterparts). You had mentioned @ameraner that you could at least sleep OK with that decision.

We still need to think about "filename" versus something else (@mraspaud any other ideas come up in the last couple days?). Another related piece of metadata we could consider including is the "repeat cycle" as a timedelta for the duration of the scheduled time, but I suppose that's redundant.

pnuu commented 2 years ago

By orbital slot they meant the equatorial crossing time for the polar orbiters, or the sub-satellite longitude for GEO satellites? Never heard either referred as slot before.

simonrp84 commented 2 years ago

I tend to use slot to mean the longitude box that a GEO sat is assigned to. For the scanning I use timeslot.

@ameraner said this:

The repeat cycle times are not there either, we always have to calculate them manually by rounding the observation times.

Off topic but I don't think that's totally correct. At least for the native L1.5 data there's the TrueRepeatCycleStart and PlannedRepeatCycleEnd attributes. For a random timeslot I just looked at these are listed as datetime.datetime(2022, 2, 10, 8, 45, 9, 856897) and datetime.datetime(2022, 2, 10, 9, 0, 9, 696503) respectively.

Regarding @djhoese's comments:

Some people pointed out that the difference between the "planned" time and the observation time is so small that it won't have an effect on anything data analysis wise

That's true for the NOAA sats but might not be true for other sats. INSAT-3DR, for example, often has a minute or two gap between planned and actual times. Point being, if we have both planned and actual listed in the file, we should also include them in the attributes from the reader whenever possible.

On the substance of the discussion: I like scheduled_{type}_time but am a bit uncomfortable with observation_{type}_time. This implies that the sensor ceases observation at the observation_end_time, which isn't true for SEVIRI or AGRI. But it's a minor issue that's only relevant for a small subset of people, so if there's general agreement on observation then I won't make a fuss :)

djhoese commented 2 years ago

This implies that the sensor ceases observation at the observation_end_time,

Doesn't it? None of the image data in question was recorded/observed after observation_end_time.

djhoese commented 2 years ago

I've started playing around with this idea with AHI HSD to see how it effected performance. Looks like using scheduled time for start_time which effects the sunz_corrected modifier kept memory about the same when generating a true_color, but it cut 50 seconds off the overall execution time.

djhoese commented 2 years ago

Interestingly I also tried setting generate=False in the Scene.load call with these changes and it did as I expected and only generated the coszen once (one resampled resolution) and dropped ~2GB from the memory usage but overall execution time was the same.

djhoese commented 2 years ago

This discussion will probably happen on slack (as it has been so far today), but the current suggestions are:

nominal_start/end_time: Example, Human-friendly 15 minute SEVIRI interval
planned/scheduled time: Ignore this for now until someone actually wants to use it.
observation_start/end_time: The actual time range that data was observed/recorded

Additionally @mraspaud started the discussion on putting these time ranges into a sub-dictionary similar to orbital_parameters, something like:

data_arr.attrs['time_parameters']['observation_start_time']

Anyone have a better name that time_parameters?

djhoese commented 2 years ago

2031 has been merged. It uses the nominal and observation names and puts them in a `time_parameters` sub-dictionary. As I mentioned as my last comment in #2031, if anyone has an issue with this naming or design please speak sooner rather than later.

pytroll / satpy

Define time metadata options and usage #2012

Feature Request

2031 has been merged. It uses the nominal and observation names and puts them in a `time_parameters` sub-dictionary. As I mentioned as my last comment in #2031, if anyone has an issue with this naming or design please speak sooner rather than later.

pytroll / satpy

Define time metadata options and usage #2012

Feature Request

2031 has been merged. It uses the nominal and observation names and puts them in a time_parameters sub-dictionary. As I mentioned as my last comment in #2031, if anyone has an issue with this naming or design please speak sooner rather than later.

2031 has been merged. It uses the nominal and observation names and puts them in a `time_parameters` sub-dictionary. As I mentioned as my last comment in #2031, if anyone has an issue with this naming or design please speak sooner rather than later.