junhuihuang / webpages

1 stars 0 forks source link

Stanford MLSys Seminar Series | #3

Closed junhuihuang closed 1 year ago

junhuihuang commented 2 years ago

I will progressively summarize talks I find illuminating from the Stanford MLSys Seminar Series here.

Talk Link: https://www.youtube.com/watch?v=DB7oOZ5hyrE

  1. Model Parallelism : This is used when a model is too large to fit into a single GPU. Here the model is distributed across multiple machines.
  2. Data Parallelism: This is used when the data is too large to fit into a single GPU. Here the data is distributed across multiple machines, but the model is instantiated in each machine.

Horovod is a system for data parallelism.

In this approach,weights are updated synchronously by a parameter server using gradients shared by workers. The weights are then returned to the workers for the next round of gradient descent.

Pros

Cons

Horovod

Data parallel framework for distributed deep learning.

API and Architecture

GPU Pinning

State Synchronization

callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]

Learning Rate Scaling

Distributed training can be accomplished with some minor augmentations to existing Keras script

Horovod Architecture

Horovod on Spark

## Warning: package 'knitr' was built under R version 4.0.5

Figure 1: A Typical Deep Learning pipeline

from tensorflow import keras
import tensorflow as tf
import horovod.spark.keras as hvd

model = keras.models.Sequential()
        .add(keras.layers.Dense(8,input_dim=2))
        .add(keras.layers.Activation('tanh'))
        .add(keras.layers.Dense(1))
        .add(keras.layers.Activation('sigmoid'))

optimizer = keras.optimizers.SGD(lr=0.1)
loss = 'binary_crossentropy'

keras_estimator = hvd.KerasEstimator(model,optimizer,loss)

pipeline = Pipeline(stages=[...,keras_estimator,...])
trained_pipeline = pipeline.fit(train_df)
pred_df = trained_pipeline.transform(test_df)

Figure 2: Deep Learning in Spark 3.0 Cluster

Deep Learning in Spark with Horovod and Petastorm

Horovod on Ray

Elastic Horovod

Future of Horovod

Hybrid Parallelism

Distributed Hyper parameter search

Unified End to End Infrastructure

Advice when using Horovod

Talk Link: https://www.youtube.com/watch?v=Y4fcSwsNqoE

Chip Floor Planning Problem

Chip Floor Planning with RL

Objective Function

where:

: Set of training graphs
: Size of training set
: RL policy parameterized by : Reward corresponding to placement of node p on graph g

Results

Reward Model Architecture

Edge property focused Graph CNNs were used.

Edge Based Convolutions

  1. Get node embeddings by passing node properties through a fully connected layer

  1. Get edge features by concatenating the embeddings of the nodes of the edge.

  1. Get edge embeddings by passing the edge feature through a fully connected layer

  1. Propagate: Get new representation of the node by taking the mean of the edge embeddings the node participates in

  1. Rinse and repeat Steps 2 - 4.7 iterations give good results. Each node is influenced by its 7-hop neighborhood
  2. Take a mean over all edges in the graph to get a representation of the entire graph.

Policy/Value Model Architecture

Other Points

Paper: https://arxiv.org/pdf/2004.10746.pdf

Workflow and SetUp

Goal is to train an ML model across multiple remote devices.

Heterogeneity Considerations

Federated Averaging

Train models on a device, share trained model with central server which averages the models ,sends it back to devices for additional local training.

FedProx

The first term is a loss function on the local weights on device k.

The second term is the proximal term. It limits the impact of local heterogeneous updates by ensuring that local update is not too different from global weights.

This approach also incorporates partial work completed by stragglers.

This approach converges despite non-IID data, local updating and partial participation

Key result: Fed prox with the proximal term leads to a 22% test accuracy improvement on average.

Fairness

A modified objective (qFFL) is given below

If , you get the traditional risk minimization objective.

If , you get minimax fairness.

Increasing reduces variance of accuracy distribution and increases fairness. This approach can cut variance in half while maintaining the overall average accuracy.

Personalization

Benchmark dataset for federated learning: LEAF

Current State

ML Data Challenges

  1. Building feature pipelines

    • Feature engineering is time consuming
    • Requires different technologies for different production requirements (distributed compute , stream processing , low latency transformation)
    • Reliable computation and backfilling of features requires a large investment
  2. Consistent data access

    • Redevelopment of pipelines leads to inconsistencies in data
    • Training-serving skew degrades model performance
    • Models needs point-in-time correct view of data to avoid label leakage (especially for time series)
  3. Duplication of effort - Siloed development - No means of collaboration or sharing feature pipelines - Lack of governance and standardization 4. Ensuring data quality - Is the model receiving right data and still operating correctly ? - Are features fresh ? - Has there been drift in data over time ?

Solution with Feature Stores

  1. Easy pipeline creation

    • Write feature definitions in SQL
    • Register in feature store specifying online or offline computation
    • Feature is computed and populated at required schedule
  2. Consistent data acess

    • Common serving API to access data for both training and serving
  3. Cataloging and Discovery

    • Can browse through library of features , how many teams use it , documentation etc.
  4. Data quality monitoring

    • Can produce statistics of data over time
    • Integrates with packages like great expectations
    • Supports Feature as code, including version control and CI/CD integration.

Deployment Patterns

  1. Offline feature serving

    • Suitable for use cases like Pricing, Risk, Churn prediction where jobs are run periodically or in an ad-hoc fashion
  2. Online feature serving

    • Suitable for low latency use cases like recommendations and personalization
  3. Online feature computation

    • Real time / on demand feature transformations are supported
    • Model is triggered when a transaction event occurs
    • The metadata within the transaction is often required to derive features synchronously rather than fetch something that has been pre-computed.
    • Models serving layer will send trxn metadata to the feature store . Feature store will use pre-computed streaming data , batch data , trxn metadata; call vendor apis; combine these, produce new features and return to mode serving layer.
    • Call vendor API from feature store, so all this data can be logged.

Feature Stores in A Modern Data Stack

  1. Greater abstraction meaning we don’t need a separate modules for an online store, offline store and data processing
    • dbt allows you to write ELT type queries
  2. ML engineers create basic model and Data scientists can optimize the model further. This is opposed to the traditional approach of DS developing models locally and ML engineers rewriting and deploying them.

Other Points

Active Learning

Core Set Selection

Active Search

Similarity Search for Efficient Active Learning and Search (SEALS)

Selection Criteria for Samples

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.