ibis-project / ibis-ml

IbisML is a library for building scalable ML pipelines using Ibis.
https://ibis-project.github.io/ibis-ml/
Apache License 2.0
52 stars 9 forks source link

feat: support more parts of end-to-end ML workflow #19

Closed deepyaman closed 2 weeks ago

deepyaman commented 4 months ago

Objectives

TL;DR

Start at the "Alternatives considered" section.

Constraints

Mapping the landscape

Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.

Breaking down an end-to-end ML pipeline

Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.

image Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline

The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.

The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:

[!NOTE]
The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.

Feature comparison (WIP)

Tecton Scikit-learn BigQuery ML NVTabular Dask-ML Ray
Feature creation Partial
Feature publishing Partial
Training dataset generation
Data segregation Partial
Cross validation
Hyperparameter tuning
Feature preprocessing
Feature selection
Model training
Feature serving Partial
Details ### Tecton * **Feature creation:** Yes * This is one of Tecton’s core value propositions. They support Spark and Rift (proprietary Python-based compute engine) for feature definition. Rift allows a broader range of Python transformations (i.e. not just SQL-like operations, and avoiding UDFs). * **Feature publishing:** Yes * The other half of Tecton’s core capabilities. * **Training dataset generation:** Yes * In Tecton, this involves first retrieving published features: https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-training/constructing-training-data * **Data segregation:** No * **Cross validation:** No * **Hyperparameter tuning:** No * **Feature preprocessing:** No * Together with model development, this is delegated to another library (e.g. scikit-learn). * **Feature selection:** No * **Model training:** No * **Feature serving:** Yes * https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference ### Scikit-learn * **Feature creation:** No * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** Yes * **Cross validation:** Yes * **Hyperparameter tuning:** Yes * **Feature preprocessing:** Yes * **Feature selection:** Yes * **Model training:** Yes * **Feature serving:** No ### BigQuery ML * **Feature creation:** No * Just write SQL in BigQuery itself. 🙂 * **Feature publishing:** Partial * [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview) * **Training dataset generation:** Yes * Either pull a BigQuery table or fetch data from Vertex AI Feature Store, depending on if features are published. * **Data segregation:** Partial * Pass `DATA_SPLIT_*` parameters to your `CREATE MODEL` statement to control how train-test splitting is done. You can’t extract the split dataset. * **Cross validation:** No (automated?) * **Hyperparameter tuning:** Yes * Pass `HPARAM_*` parameters to your `CREATE MODEL` statement. * **Feature preprocessing:** Yes * E.g. https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer * **Feature selection:** No * Does have https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-importance * **Model training:** Yes * **Feature serving:** Partial * See feature publishing ### NVTabular * **Feature creation:** Partial * [LambdaOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/lambdaop.html) and [JoinExternal](https://nvidia-merlin.github.io/NVTabular/v0.6.1/api/ops/joinexternal.html) enable very simple row-level feature engineering * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** No * **Cross validation:** No * **Hyperparameter tuning:** No * **Feature preprocessing:** Yes * **Feature selection:** No * **Model training:** No * **Feature serving:** No ### Dask-ML * **Feature creation:** No * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** Yes * **Cross validation:** Yes * **Hyperparameter tuning:** Yes * **Feature preprocessing:** Yes * **Feature selection:** No * **Model training:** Yes * **Feature serving:** No ### Ray * **Feature creation:** * **Feature publishing:** * **Training dataset generation:** * **Data segregation:** * **Cross validation:** * **Hyperparameter tuning:** * **Feature preprocessing:** * **Feature selection:** * **Model training:** * **Feature serving:**

Ibis-ML product hypotheses

Scope

Alternatives considered

End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:

  1. Ibis-ML supports fitting data preprocessing steps (during the training process) and applying pre-trained Ibis-ML preprocessing steps (during inference).
    • Pros: Ibis-ML is used during both the training and inference process
    • Cons: Ibis-ML only supports data preprocessing, and even then a subset of steps that can be fit in database (e.g. not some very widely-used steps like PCA, that fit in the middle of the data-preprocessing pipeline)
  2. Ibis-ML supports constructing transformers from a wider range of pre-trained preprocessors and models (from other libraries, like scikit-learn), and applying them across backends (during inference).
    • Pros: Ibis-ML gives users the ability to apply a much wider range of steps in the ML process during inference time, including preprocessing steps that can be fit linearly (e.g. PCA) and even linear models (e.g. SGDRegressor, GLMClassifier). You can even showcase the end-to-end capabilities just using Ibis (from raw data to model outputs, all on your database, across streaming and batch, powered by Ibis)
    • Cons: Ibis-ML doesn't support training the preprocessors on multiple backends; the expectation is that you use a dedicated library/existing local tools for training
  3. A combination of 1 & 2, where Ibis-ML supports a wider range of pre-processing steps and models, but only a subset support a fit method (those that don't need to be constructed .from_sklearn() or something).
    • Pros: Support the wider range of operations, and also fitting everything on the database in simple cases.
    • Cons: ~Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified.~ @jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.

Proposal

I propose to go with option #3 of the alternatives considered. In practice, this will mean:

This also means that the following will be out of scope (at least, for now):

Deliverables

Guiding principles

Demo workflows

  1. Fit preprocessing on DuckDB (local experience, during experimentation)
    1. Experiment with different features
  2. Fit finalized preprocessing on larger dataset (e.g. from BigQuery)
  3. Perform inference on larger dataset

We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.

We need variants for all of:

With less priority:

High-level deliverables

P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).

Questions for validation

Changelog

2024-03-19

Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:

jitingxu1 commented 4 months ago

Thanks @deepyaman put them together.

Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.

Does not address exploratory data analysis (EDA) or model training-related procedures

deepyaman commented 4 months ago

Thanks @deepyaman put them together.

Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.

Does not address exploratory data analysis (EDA) or model training-related procedures

I agree that it could be valuable to handle more where Ibis is well-suited (e.g. some EDA). Your open issue on the ibis repo is very relevant. W.r.t. model training, that ultimately would need to be handled by other libraries, but we should make sure that the handoffs are smooth and efficient.

Feature engineering is a much bigger topic; I could see Ibis-ML expanding in that direction, to include some auto-FE (a la Featuretools), but it's not clear whether that's a priority. It's also a bit separate from the initial focus.

deepyaman commented 4 months ago

For consideration from @jcrist just now: Consider something like transform_sklearn(est, table) -> table over from_sklearn(est) -> some_new_type to avoid naming/designing the some_new_type object.

@deepyaman: The some_new_type could just be a transform (or step post-refactor?); check which option will be easier.

deepyaman commented 2 weeks ago

IbisML 0.1.0 is released and covers most of this.