Closed ntakouris closed 2 years ago
Update: It seems that TF 2.3.0 added tf.keras.preprocessing.timeseries_dataset_from_array
With some extra handy other features like sampling, etc.
Still not part of input pipelines with multiple split tfrecord files. For big time-series in multiple tf-record files you'd need to merge partN[window:] + partN+1[:window]
to avoid any data loss, if you are using the preprocessing pipeline I provided below (to make the window+1
index be the label). The other problems stated still exist.
Since this Issue is entitled "Improving time series forecasting and similar tasks" I'd like to mention a problem we've encountered while working on adapting the TFX pipeline for a time series problem: As described here there are no guarantees that the tf.Examples which are generated and used in the TFX pipeline are maintained in a specific order. This is quite a fundamental problem, as we need the data to be ordered (by timestamp or similar) so the windows which are generated from the data are meaningful.
We're currently looking into feeding already windowed data into ExampleGen as suggested as a workaround, but of course it'd would be great if this issue could be solved within TFX.
I did write some blog post the other day on how to efficiently use tf.data.Dataset to preserve data integrity across multiple tfrecord files. I won't go into details but you can read more about it at:
“Advanced Tensorflow Data Input Pipelines: Handling Time Series” https://link.medium.com/Y5SbNJF4vbb
But a requirement for that is to split inputs into multiple tfrecord files as an ordered series. Ex. Record 1 is from row 0 - N, Record 2 is from row N+1 to M and so on. I'm not sure if the tfx example gen or preprocessing stage preserves order and I did not have time to test it (multiple workers on beam) at this stage.
Also, features like having the evaluator use an input function as well are still missing.
Performance note: I'm not sure if partitioning such dataset across multiple files can improve performance (so that loading this from rotational drives on a distributed file system is on par with pushing everything into a super fast nvme drive). Currently I just use a big ~20 GB (and growing about 0.5GB per month) tf record file.
These things can get much easier with some utility functions from the tfx ecosystem.
As far as I understand combining this Issue with SequenceExample should do the trick without the need to split into multiple tfrecord files. Am I right?
I am still looking for a working example to feed a trainer with sliding windows (via datasets). I couldn't figure out yet, how to use SequenceExample with datasets. Does anyone have a working example (not a NLP one, but TimeSeries)?
I disagree with SequenceExample
. I do not want to windowize my data beforehand as it is easily doable with an input function if the files are read in order (even if the data points are sharded in small files--in order).
Suppose you fix ExampleGen
to work deterministically with huge amounts of data and read everything in order (even with a single thread).
There are still some parts missing:
Evaluator
does not use an input function. Neither does InfraValidator
, neither does BulkInferrer
. If you do not do materialize
in the preprocessing stage, this makes sense for other use cases, including time series tasks.
The schema
protobuf becomes problematic. You can just keep it as is without infer_feature_shape
, but client should take extra steps in inference through tf.serving. I have not figured a way to fix that, other than to parse the existing schema with a custom component and publish it as an artifact again, by injecting window size.
Given that you do have a tf.data.Dataset pipeline to do the windowing stage, it is not possible (or extremely hard, I did not bother finding if a workaround is possible) to include window size in the hyperparameter search space. The window is usually defined as a constant, and just used in the input_fn
(input_fn can not change between hypermodel builds)
Workarounds in the model signature to support this kind of variable window. Here is an example of some recurrent model I've built. Essentially, the window size will be exposed on the serving signature as metadata: input_fn_utils.py model_building_utils.py model_sample.py (there is some unnecessary data point drops in the tf.data.Dataset because I have not updated as per the link I posted above, but for everything else, you get the idea on how to do tf.data.Dataset windowing properly and how to change the default serving signature)
Final note: if all these are fixed, it is much easier to only have an append-only input tfrecord file stack, and to not copy the entire tf record dataset on multiple runs (assuming you re-adapt the tft preprocessing layers without materialization_
I share my concerns with @ntakouris , and I am having many problems using TFX handling time series, especially in the context of dimensionality. I would appreciate more detail in the documentation or a complete example.
so stale, that I moved to pytorch
I've been creating an e2e pipeline for the past month using tfx. I feel that there are some things that could be vastly improved for such tasks. Please read on and if something does not make sense,
@
me to explain more :), this is a matter open for discussion.The task is simple: given a
input_window_size
features, compute the next feature values on the next timestep. Preprocessing will always be an easy 3-liner with transform:(just z-score scale everything)
From this point on, these problems require attention:
(None,)
, both in and out. It makes parsing a bit harder on the serving side and includes some schema problems._input_fn
depends on the window parametersLack of a label column
Given a dataset D (T timesteps, N features), window in W-sized frames using the W+i element as a label. This is a bit of a hassle to do with just
tfds
. Here's a sample code that doesThis works, but produces problems in the serving and hyperparameter searching approaches, restricting you to use constant windows per job run, because the
tuner_fn
does not call yourrun_fn
directly (because it also saves the model), but just builds and evaluates the model on it's own context. Therefore, your input_fn can't depend on hyperparameters.This could be fixed by having a separate function that saves the model for serving and moving model and input building to a different function that's shared to tuner and trainer.
Serving
I've not managed to invoke the model I build this way with
tf.Example
, but I made a new_get_serve_raw_fn
that's used to parse raw-json time series data. There are a couple of problems.First, by the preprocessing and tfds loading. The model just receives tensors (unmarshall function on the tfds), so you've got to make some sort of manual mapping from transformed key name tensors (ex. feature to feature_xf), either by adding another marshalling reconstruction layer (bring back ditched dict keys) on the dataset loading side along with named model inputs, or by changing the serving function to map
tft_layer
outputs to your model.The second problem is related to the dimensions of
tft_transform
, where you need to put inputs in the form of(None,)
to broadcast the z-score operations to all the window feature columns (using None in the signature spec is not optimal, constraints can be typed into the input tensor spec), and then reshape (by messing with the batch dimension) via the keras backend, in order to support our normal model invocation. Here's my serving function builder and my signature:More on type-safety and signatures: The shape can't be exposed on the
tf.Example
data as well--but this is compensated by using the schema on the protocol buffers (but there's the issue of dimensions with theNone,
of the tf transform again. Altough, this could be fixable by generating a specific schema with the model size after training, before saving the model for training (still a workaround).Sidenote: if someone got a specific problem, you can workaround it by using a fake
tf.function
to expose some input or output metadata to use on serving, if you've got a model client that can produce features of arbitrary window size easily (thus depending on the model to require a specific window size).Other micro-issues
FixedLenFeature(shape=None)
. This is kind of included in the windowing and dimension problems mentioned above.The hyperparams artifact passed on the trainer_fn needs 2 lines of custom unpacking (couldn't find it in examples):