Closed LanstonWu closed 2 years ago
Simply speaking, TonY's implementation principle is to obtain the resources needed for training from Hadoop Yarn, such as machine nodes, memory, and CPU/GPU. After obtaining the machine resources (TonY task executor).
The distributed TensorFlow training needs to specify the tf_config
environment configuration for each role(refer to https://www.tensorflow.org/guide/distributed_training). Therefore, when the above TonY task executors are completed, the tf_config
of each role of TensorFlow will be provided by TonY.
Back to this ISSUE, why it runs like independent tasks? I think your tensorflow training code dont support distributed training mode, so recommend using the tf estimator or keras API.
Thank you. I've learned.
version:
I use example from tensorflow, which called: Basic text classification, I downloaded example data and unzip it, then uploaded it to hdfs,
In the traning code, I use tensorflow_io to read data from hdfs, it look like,
and then I created tony configure file,
finally run the job,
the job was executed successful,
I checked the three container worker (container_e75_1623855961871_1205_01_000002, container_e75_1623855961871_1205_01_000003, container_e75_1623855961871_1205_01_000004) logs, and I noticed that three container worker are doing same work(read file from hdfs, training data, and get different evaluate results), it's not like distributed training as a Hadoop application, just like running single in different container worker.
container_e75_1623855961871_1205_01_000002 log:
container_e75_1623855961871_1205_01_000003 log:
container_e75_1623855961871_1205_01_000004 log:
Do I misunderstanding or where the configuration error caused this? how to let it doing distributed training. Thank you.