Add a target_width parameter to keras.utils.timeseries_dataset_from_array

To3No7 commented 1 year ago

Feature request:

It would have been nice if there was a parameter _targetwidth (a.k.a. label_width) for keras.utils.timeseries_dataset_from_array which allowed the target to be sequences longer than just one timestep as it now assumes. Compare to the class WindowGenerator in https://www.tensorflow.org/tutorials/structured_data/time_series which have a label_width.

This would simplify the code for the case we want to generate all intermediate timestep when predicting with a target shift, like predicting weather at hour 48 from the sequence between hour 0 and 24. That is, I want to have the target to be the full sequence 25-48 instead of just 48.

As it is now I have to make two calls to timeseries_dataset_from_array and therefore I will be missing the shuffle function like this:

def datasetgen(dataframe, input_width=24, label_width=1, shift=1, batch_size=128,
  label_columns=None, start_index=None, end_index=None, shuffle=False):
  """
  Generate timeseries dataset from the given dataframe containing a data sequence.

  Parameters:
  - dataframe: The source time sequence dataframe.
  - input_width: Number of time steps in each input sequence.
  - label_width: Number of time steps in each label sequence.
  - shift: How many steps to shift the end of the input to get the label.
  - batch_size: Size of batches to generate.
  - label_columns: List of column names to extract as labels. If None, all columns are used.
  - start_index: Start index from the dataframe to consider data. Default is the start of the dataframe.
  - end_index: End index from the dataframe to consider data. Default is the end of the dataframe.
  - shuffle: Whether to shuffle the generated batches. Note: shuffling won't work in the current implementation.

  Returns:
  - A TensorFlow Dataset containing input and label sequences.
  """

  # If end index or start index is not given, assign them to the end or start of the dataframe respectively.
  if end_index is None:
      end_index = len(dataframe) - 1
  if start_index is None:
      start_index = 0

  # Generate a input timeseries dataset from the dataframe using keras.utils.timeseries_dataset_from_array
  input_ds = tf.keras.utils.timeseries_dataset_from_array(
                  dataframe, targets=None, sequence_length=input_width,
                  sequence_stride=1, sampling_rate=1, batch_size=batch_size, shuffle=shuffle,
                  start_index=start_index, end_index=(end_index-(input_width+shift-1)))

  # Fetch the indices of the label columns from the dataframe.
  label_columns_indices = get_label_columns_indices(dataframe,label_columns) # get the selected columns
  targetsdf = dataframe[list(label_columns_indices)]

  # Generate a timeseries dataset of label sequences from the dataframe.
  # Here we assume that label_width should be less than or equal to shift.
  target_ds = tf.keras.utils.timeseries_dataset_from_array(
  targetsdf, targets=None, sequence_length=label_width,
  sequence_stride=1, sampling_rate=1, batch_size=batch_size, shuffle=shuffle,
  start_index=(start_index+(input_width+shift-label_width)),
  end_index=end_index-input_width+1)

  # Combine input and target datasets to form a single dataset.
  train_ds = tf.data.Dataset.zip(input_ds,target_ds)

  return train_ds

tilakrayal commented 1 year ago

@To3No7, Thank you for the issue. Could you please provide any specific Use-case for the above feature which might help us to analyse the issue. Thank you!

To3No7 commented 1 year ago

As part of a course I am teaching in deep learning one part is time series forecasting with RNNs. We work with the Jena weather data, following in part the Tensorflow example https://www.tensorflow.org/tutorials/structured_data/time_series.

Part of this exercise is to:

Make two different models that should both predict what the temperature will be 24 hours ahead based on the last 24 hours' readings. That is, if your input sequence e.g. runs from hour 1 through hour 24, then you should predict what the temperature is at hour 48. Do this using these two methods:
- Part 2.1. Direct prediction of the value you are looking for (single-step with a new target time-step). That is, use 24 hours of input values to predict only one output value. 24 hours ahead (hour 48).
- Part 2.2. Make a prediction of all intermediate values as well. That is, again use 24 hours of input values, but predict all values from hour 25 through hour 48. Your output will now be a 24-value long vector at each prediction (Single shot prediction).
Note that for these two models we are only interested in the quality of the prediction at hour 48, so you need to find a way to measure the performance for this particular hour in order to compare the models against model 2.1 and among themselves. Compare these two models and analyze the difference in results, e.g. which gives the best prediction?

Thus to do the Part 2.2 the students need to have target_width of 24 instead of 1.

Some year back I have been suggesting the students should use the windowing function from the tensorflow tutorial, but this spring I suggested they use the keras function timeseries_dataset_from_array instead as that would be a more clean solution. However, as timeseries_dataset_from_array didn’t have a target_width, almost no student was able to implement a correct datasetgen for Part 2.2 and I had to quickly hack together a solution for them to use (as seen above).

Is this clear enough?

I might be able to provide solution code, but this have to be done privately for obvious reasons.

Zekrom-7780 commented 1 year ago

@sachinprasadhs @qlzh727 @tilakrayal I'll pick this one up

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

keras-team / tf-keras

Add a target_width parameter to keras.utils.timeseries_dataset_from_array #7