[RFC] Rethinking of ML Preproc in PyTorch Ecosystem

TL; DR: Is nn.Module all you need for last-mile preproc?

TorchArrow started to rethink data preparation pipelines for AI. With iterations over real product workload launches, we believe this is now the right time to rethink about the future strategy and direction.

The ML Data world can be categorized into the two parts: (1) Dataset Preparation (which is more offline; a lot of them also known as feature engineering) (2) Last-mile Preproc (which interacts and iterates more with model authoring). The boundary can sometimes be vague, as during new model iteration, more parts are considered as "Last-mile Preproc"; but later they may gradually stabilized and graduated into Dataset Preparation.

The Dataset Preparation part is a natural fit for DataFrame (and can be potentially unified with feature engineering). But the last-mile preproc, nn.Module, together with Dict[Tensor] or TensorDict flavor structure seems to be a more natural way for ML engineers.

One potential approach (request for comment) is to use DataFrame for dataset preparation, and nn.Module for the last-mile preproc authoring. And we can implement a unified executor that supports both -- which can executes both Velox runtime and PyTorch runtime (e.g. a package and serialized nn.Module) to preproc materialization. Apache Arrow memory layout allows smooth interop between Data and AI.

See the following docs for details and discussions.

https://docs.google.com/document/d/1RHQDCAqLCAt9EkbtaUrd5ETjrbe_7HoV-9Mfxlohq4c/

pytorch / torcharrow

[RFC] Rethinking of ML Preproc in PyTorch Ecosystem #515