ibis-project / talks

3 stars 2 forks source link

OSDC West - CFP Closes 05/20/24 #29

Open ncclementi opened 3 months ago

ncclementi commented 3 months ago

https://odsc.com/california/call-for-speakers-west/

Talk Session Formats

Proposals will be considered for the following types of presentations:

Format for Technical Sessions

- Talk (30 minutes)
- Hands-on Workshop (2 hrs)
- Tutorial (60 min, hands-off)
- Lightning Talks (10 min)

Format for Business Sessions

- Talk (30 minutes)
- Case Studies (30 minutes)
- Hands-on workshop (60 minutes)
- Startup Talk (30 minutes)
jitingxu1 commented 1 month ago

update 1 based on comments

Title: IbisML: Efficiently Streamlining and Unifying ML Data Processing from Development to Production

Description

Machine learning projects require transforming raw data into prepared samples using a combination of feature engineering pipelines and online last-mile processing, integrated with model training workflows. Data scientists and engineers collaborate to prototype, develop, scale, and deploy both batch and streaming jobs. These processes present several challenges:

To address these challenges, IbisML harnesses the power of Ibis, offering a library designed to streamline and unify data preprocessing and feature engineering workflows across diverse environments and data scales. Its unified codebase eliminates the need for rewriting logic during transitions from local development to large-scale distributed production and from batch to streaming with the following key features:

The talk will explore the gaps in existing projects, emphasizing how IbisML effectively tackles these challenges and enables seamless transitions between development and deployment, both in offline and online deployment scenarios. At the end of this talk, we'll leverage IbisML to craft machine learning models starting from data engineering, through last-mile preprocessing using ibisML Recipes across various backends, and feeding diverse data into downstream model training libraries or frameworks like scikit-learn, XGBoost, and PyTorch.

Notes

deepyaman commented 1 month ago

@jitingxu1 Some quick notes:

I don't 100% know whether need to talk about streaming in this talk, but I feel like there may be differing views.

At a higher level (probably more important): what will the talk cover? I think this is a description of the project, and what it can do, but not of the talk. I think there must be some more clear discussion about what the gap with current projects is.

jitingxu1 commented 1 month ago

For example, using spark for batch feature, flink for streaming features, and pandas, scikit-learn or pytorch for last-mile preprocessing.

What does multilingual frameworks mean? This point isn't clear to me personally.

Highlighting the strength of IbisML, it has streaming support, This capability might distinguish it from other options.

I don't 100% know whether need to talk about streaming in this talk, but I feel like there may be differing views.

zhenzhongxu commented 1 month ago

Great write-up! I wonder if it makes sense to talk about the benefits of moving away from sampling into using more holistic data and it just works with IbisML. I've heard a few cases where large organizations desire to train models using the full data instead of relying on sampling.

jitingxu1 commented 1 month ago

Here is the submitted version, Thanks @ncclementi @deepyaman @chip for review.

Title: Building ML pipelines that run anywhere with IbisML

Abstract

From inception to production, the ML lifecycle requires a lengthy process involving multiple people, programming languages, and computational frameworks. In a traditional workflow, data scientists develop models and experiment with different features locally, using tools like pandas and scikit-learn on a small, often subsampled, dataset. However, as the need arises to scale up to larger datasets and production environments, engineers face the challenge of rewriting and testing these processes in distributed computing systems like Apache Spark or Dask. While frameworks like these have their own ML libraries (of various flavors and maturities) and technically allow the user to run on single machines and clusters, scaling ML pipelines is costly, resource-intensive, and inefficient.

IbisML is an open-source, Python library designed for building and running scalable ML pipelines from experiment to production. It’s built on top of Ibis, an open source library that provides a familiar dataframe API to build up expressions that can be executed on a wide array of backends. They can use tools like DuckDB and Polars for efficient local computation, then scale to distributed engines such as Spark, BigQuery, and Snowflake. With IbisML, users can preprocess data at scale across development and deployment, compose transformations with other scikit-learn estimators, and seamlessly integrate with scikit-learn, XGBoost, and PyTorch models without rewriting code.

In this talk, we will introduce IbisML and the utilities it provides to streamline ML pipeline development. We will demonstrate its functionalities on a simple, real-world problem, including the ability to train and fit estimators on different backends. Finally, we will showcase how you can efficiently hand off to the modeling framework of your choice.