Closed deepyaman closed 2 weeks ago
Thanks @deepyaman put them together.
Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.
Does not address exploratory data analysis (EDA) or model training-related procedures
Thanks @deepyaman put them together.
Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.
Does not address exploratory data analysis (EDA) or model training-related procedures
I agree that it could be valuable to handle more where Ibis is well-suited (e.g. some EDA). Your open issue on the ibis
repo is very relevant. W.r.t. model training, that ultimately would need to be handled by other libraries, but we should make sure that the handoffs are smooth and efficient.
Feature engineering is a much bigger topic; I could see Ibis-ML expanding in that direction, to include some auto-FE (a la Featuretools), but it's not clear whether that's a priority. It's also a bit separate from the initial focus.
For consideration from @jcrist just now: Consider something like transform_sklearn(est, table) -> table
over from_sklearn(est) -> some_new_type
to avoid naming/designing the some_new_type
object.
@deepyaman: The some_new_type
could just be a transform (or step post-refactor?); check which option will be easier.
IbisML 0.1.0 is released and covers most of this.
Objectives
TL;DR
Start at the "Alternatives considered" section.
Constraints
Mapping the landscape
Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.
Breaking down an end-to-end ML pipeline
Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.
The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.
The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:
Feature comparison (WIP)
Details
### Tecton * **Feature creation:** Yes * This is one of Tecton’s core value propositions. They support Spark and Rift (proprietary Python-based compute engine) for feature definition. Rift allows a broader range of Python transformations (i.e. not just SQL-like operations, and avoiding UDFs). * **Feature publishing:** Yes * The other half of Tecton’s core capabilities. * **Training dataset generation:** Yes * In Tecton, this involves first retrieving published features: https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-training/constructing-training-data * **Data segregation:** No * **Cross validation:** No * **Hyperparameter tuning:** No * **Feature preprocessing:** No * Together with model development, this is delegated to another library (e.g. scikit-learn). * **Feature selection:** No * **Model training:** No * **Feature serving:** Yes * https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference ### Scikit-learn * **Feature creation:** No * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** Yes * **Cross validation:** Yes * **Hyperparameter tuning:** Yes * **Feature preprocessing:** Yes * **Feature selection:** Yes * **Model training:** Yes * **Feature serving:** No ### BigQuery ML * **Feature creation:** No * Just write SQL in BigQuery itself. 🙂 * **Feature publishing:** Partial * [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview) * **Training dataset generation:** Yes * Either pull a BigQuery table or fetch data from Vertex AI Feature Store, depending on if features are published. * **Data segregation:** Partial * Pass `DATA_SPLIT_*` parameters to your `CREATE MODEL` statement to control how train-test splitting is done. You can’t extract the split dataset. * **Cross validation:** No (automated?) * **Hyperparameter tuning:** Yes * Pass `HPARAM_*` parameters to your `CREATE MODEL` statement. * **Feature preprocessing:** Yes * E.g. https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer * **Feature selection:** No * Does have https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-importance * **Model training:** Yes * **Feature serving:** Partial * See feature publishing ### NVTabular * **Feature creation:** Partial * [LambdaOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/lambdaop.html) and [JoinExternal](https://nvidia-merlin.github.io/NVTabular/v0.6.1/api/ops/joinexternal.html) enable very simple row-level feature engineering * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** No * **Cross validation:** No * **Hyperparameter tuning:** No * **Feature preprocessing:** Yes * **Feature selection:** No * **Model training:** No * **Feature serving:** No ### Dask-ML * **Feature creation:** No * **Feature publishing:** No * **Training dataset generation:** No * **Data segregation:** Yes * **Cross validation:** Yes * **Hyperparameter tuning:** Yes * **Feature preprocessing:** Yes * **Feature selection:** No * **Model training:** Yes * **Feature serving:** No ### Ray * **Feature creation:** * **Feature publishing:** * **Training dataset generation:** * **Data segregation:** * **Cross validation:** * **Hyperparameter tuning:** * **Feature preprocessing:** * **Feature selection:** * **Model training:** * **Feature serving:**Ibis-ML product hypotheses
Scope
ibis.Table
as training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."Alternatives considered
End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:
.from_sklearn()
or something).Proposal
I propose to go with option #3 of the alternatives considered. In practice, this will mean:
from_sklearn
(and, in the future, potentially other libraries)This also means that the following will be out of scope (at least, for now):
Deliverables
Guiding principles
Demo workflows
We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.
We need variants for all of:
With less priority:
High-level deliverables
P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).
to_dmatrix
/to_dask_dmatrix
are already implementedtidymodels
from_sklearn
from_sklearn
(i.e. those with predict functions that don't require UDFs)from_sklearn
(e.g. PCA, or some more frequently used step)from_sklearn
(e.g. SGDRegressor)Questions for validation
Changelog
2024-03-19
Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:
from_sklearn
is no longer a priority, from P0 to P3.sklearn.preprocessing
is a higher priority. We break down the relative priority of implementing steps in a separate issue.