sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
731 stars 113 forks source link

ElasticFlow implementation tasks #16

Closed zou000 closed 5 years ago

zou000 commented 5 years ago

ParameterServer

  1. RPC interface

    • [ ] service definition
    • [ ] SparseTensor support
  2. RPC implementation

    • [ ] TF/PyTorch tensor to.from Tensor proto bidirectional convertors
    • [ ] multi-threaded Server
    • [ ] Client
  3. PS implementation

    • Graph gradient update
    • large model support: model partition
    • large lookup table support (e.g. use redis)

Data API

  1. data fetcher: Need a unified way for feeding data.
    • ODPS support for TF, PyTorch
      • tf.data.DataSet and torch.utils.data.Dataset
    • Other data source?
  2. data sharding
    • sharding methods
    • shard tracking: re-queue shard when worker dies
      • use etcd?

Worker with PS client/data fetcher/etcd client

  1. PyTorch worker
  2. Estimator Worker
    • Current POC crashes when iteration number is high, need to debug first
  3. Other TF workers

Master

Job launcher

  1. based on user code, decide PS and worker to launch
  2. local launcher (for testing)
  3. kubernetes launcher
    • on premise
    • AWS
    • GKE

Elasticity, fault tolerance support

Kubernetes controller

  1. resource tracking
  2. scaling
wangkuiyi commented 5 years ago

Thanks to @zou000 for this plan!

A few cents from me:

  1. For offline training, ElasticFlow needs to be able to read from files on (distributed) filesystems. The file format must support partition in order to enable distributed training. More about the motivation is in https://github.com/wangkuiyi/recordio/blob/master/README.md.