keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
64 stars 30 forks source link

Add a target_width parameter to keras.utils.timeseries_dataset_from_array #7

Open To3No7 opened 1 year ago

To3No7 commented 1 year ago

Feature request:

It would have been nice if there was a parameter _targetwidth (a.k.a. label_width) for keras.utils.timeseries_dataset_from_array which allowed the target to be sequences longer than just one timestep as it now assumes. Compare to the class WindowGenerator in https://www.tensorflow.org/tutorials/structured_data/time_series which have a label_width.

This would simplify the code for the case we want to generate all intermediate timestep when predicting with a target shift, like predicting weather at hour 48 from the sequence between hour 0 and 24. That is, I want to have the target to be the full sequence 25-48 instead of just 48.

As it is now I have to make two calls to timeseries_dataset_from_array and therefore I will be missing the shuffle function like this:

def datasetgen(dataframe, input_width=24, label_width=1, shift=1, batch_size=128,
  label_columns=None, start_index=None, end_index=None, shuffle=False):
  """
  Generate timeseries dataset from the given dataframe containing a data sequence.

  Parameters:
  - dataframe: The source time sequence dataframe.
  - input_width: Number of time steps in each input sequence.
  - label_width: Number of time steps in each label sequence.
  - shift: How many steps to shift the end of the input to get the label.
  - batch_size: Size of batches to generate.
  - label_columns: List of column names to extract as labels. If None, all columns are used.
  - start_index: Start index from the dataframe to consider data. Default is the start of the dataframe.
  - end_index: End index from the dataframe to consider data. Default is the end of the dataframe.
  - shuffle: Whether to shuffle the generated batches. Note: shuffling won't work in the current implementation.

  Returns:
  - A TensorFlow Dataset containing input and label sequences.
  """

  # If end index or start index is not given, assign them to the end or start of the dataframe respectively.
  if end_index is None:
      end_index = len(dataframe) - 1
  if start_index is None:
      start_index = 0

  # Generate a input timeseries dataset from the dataframe using keras.utils.timeseries_dataset_from_array
  input_ds = tf.keras.utils.timeseries_dataset_from_array(
                  dataframe, targets=None, sequence_length=input_width,
                  sequence_stride=1, sampling_rate=1, batch_size=batch_size, shuffle=shuffle,
                  start_index=start_index, end_index=(end_index-(input_width+shift-1)))

  # Fetch the indices of the label columns from the dataframe.
  label_columns_indices = get_label_columns_indices(dataframe,label_columns) # get the selected columns
  targetsdf = dataframe[list(label_columns_indices)]

  # Generate a timeseries dataset of label sequences from the dataframe.
  # Here we assume that label_width should be less than or equal to shift.
  target_ds = tf.keras.utils.timeseries_dataset_from_array(
  targetsdf, targets=None, sequence_length=label_width,
  sequence_stride=1, sampling_rate=1, batch_size=batch_size, shuffle=shuffle,
  start_index=(start_index+(input_width+shift-label_width)),
  end_index=end_index-input_width+1)

  # Combine input and target datasets to form a single dataset.
  train_ds = tf.data.Dataset.zip(input_ds,target_ds)

  return train_ds
tilakrayal commented 1 year ago

@To3No7, Thank you for the issue. Could you please provide any specific Use-case for the above feature which might help us to analyse the issue. Thank you!

To3No7 commented 1 year ago

As part of a course I am teaching in deep learning one part is time series forecasting with RNNs. We work with the Jena weather data, following in part the Tensorflow example https://www.tensorflow.org/tutorials/structured_data/time_series.

Part of this exercise is to:

Thus to do the Part 2.2 the students need to have target_width of 24 instead of 1.

Some year back I have been suggesting the students should use the windowing function from the tensorflow tutorial, but this spring I suggested they use the keras function timeseries_dataset_from_array instead as that would be a more clean solution. However, as timeseries_dataset_from_array didn’t have a target_width, almost no student was able to implement a correct datasetgen for Part 2.2 and I had to quickly hack together a solution for them to use (as seen above).

Is this clear enough?

I might be able to provide solution code, but this have to be done privately for obvious reasons.

Zekrom-7780 commented 1 year ago

@sachinprasadhs @qlzh727 @tilakrayal I'll pick this one up

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.