For offline training, ElasticFlow needs to be able to read from files on (distributed) filesystems. The file format must support partition in order to enable distributed training. More about the motivation is in https://github.com/wangkuiyi/recordio/blob/master/README.md.
ParameterServer
RPC interface
RPC implementation
PS implementation
Data API
tf.data.DataSet
andtorch.utils.data.Dataset
Worker with PS client/data fetcher/etcd client
Master
Job launcher
Elasticity, fault tolerance support
Kubernetes controller