ElasticFlow implementation tasks - Githubissues

sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework

https://elasticdl.org

MIT License

731 stars 113 forks source link

ElasticFlow implementation tasks #16

Closed zou000 closed 5 years ago

zou000 commented 5 years ago

ParameterServer

RPC interface
- [ ] service definition
- [ ] SparseTensor support
RPC implementation
- [ ] TF/PyTorch tensor to.from Tensor proto bidirectional convertors
- [ ] multi-threaded Server
- [ ] Client
PS implementation
- Graph gradient update
- large model support: model partition
- large lookup table support (e.g. use redis)

Data API

data fetcher: Need a unified way for feeding data.
- ODPS support for TF, PyTorch
  - tf.data.DataSet and torch.utils.data.Dataset
- Other data source?
data sharding
- sharding methods
- shard tracking: re-queue shard when worker dies
  - use etcd?

Worker with PS client/data fetcher/etcd client

PyTorch worker
Estimator Worker
- Current POC crashes when iteration number is high, need to debug first
Other TF workers

Master

divide work, wait for workers to finish

Job launcher

based on user code, decide PS and worker to launch
local launcher (for testing)
kubernetes launcher
- on premise
- AWS
- GKE

Elasticity, fault tolerance support

periodic checkpoint/model saving (to file? redis?)
initial model loading (from file? redis? PS?)
PS scaling
- partitioned model support

Kubernetes controller

resource tracking
scaling

wangkuiyi commented 5 years ago

Thanks to @zou000 for this plan!

A few cents from me:

For offline training, ElasticFlow needs to be able to read from files on (distributed) filesystems. The file format must support partition in order to enable distributed training. More about the motivation is in https://github.com/wangkuiyi/recordio/blob/master/README.md.