tony-framework / TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
https://tony-project.ai
Other
708 stars 164 forks source link

Confusing of running tensorflow job with TonY on Apache Hadoop #619

Closed LanstonWu closed 2 years ago

LanstonWu commented 2 years ago

version:

Tensorflow=2.7.0
Hadoop=3.1.0
Python=3.8
Tony=0.4.10

I use example from tensorflow, which called: Basic text classification, I downloaded example data and unzip it, then uploaded it to hdfs,

$ hdfs dfs -ls hdfs://testcs/tmp/tensorflow/imdb/

Found 5 items
-rw-r-----   3 hive hdfs    1641221 2021-11-16 15:57 hdfs://testcs/tmp/tensorflow/imdb/imdb_word_index.json
-rw-r-----   3 hive hdfs   13865830 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_test.npy
-rw-r-----   3 hive hdfs   14394714 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/x_train.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_test.npy
-rw-r-----   3 hive hdfs     200080 2021-11-16 15:26 hdfs://testcs/tmp/tensorflow/imdb/y_train.npy

In the traning code, I use tensorflow_io to read data from hdfs, it look like,

with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/x_train.npy', mode='rb') as r:
    x_train=np.load(r,allow_pickle=True)

with tf.io.gfile.GFile('hdfs://testcs/tmp/tensorflow/imdb/y_train.npy', mode='rb') as r:
    labels_train=np.load(r,allow_pickle=True)

and then I created tony configure file,

vi tony-test.xml

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>3</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.application.security.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tony.yarn.queue</name>
    <value>root.test01</value>
  </property>
</configuration>

finally run the job,

java -cp `hadoop classpath`:./tensorflow/tony-cli-0.4.10-uber.jar:job01/*:job01 com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=venv.zip \
--src_dir=job01/src \
--executes=models/movie_comm02.py \
--task_params="--data_dir hdfs://testcs/tmp/tensorflow/input --output_dir hdfs://testcs/tmp/tensorflow/output " \
--conf_file=tony-test.xml \
--python_binary_path=venv/bin/python

the job was executed successful,

21/11/17 11:46:26 INFO tony.TonyClient: TonY heartbeat interval [1000]
21/11/17 11:46:26 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
21/11/17 11:46:26 INFO tony.TonyClient: Starting client..
21/11/17 11:46:26 INFO client.RMProxy: Connecting to ResourceManager at test-hadoop01/192.168.1.10:8050
21/11/17 11:46:26 INFO client.AHSProxy: Connecting to Application History server at test-hadoop02/192.168.1.11:10200
21/11/17 11:46:26 INFO conf.Configuration: found resource resource-types.xml at file:/opt/hadoop/resource-types.xml
21/11/17 11:46:28 INFO tony.TonyClient: Running with secure cluster mode. Fetching delegation tokens..
21/11/17 11:46:28 INFO tony.TonyClient: Fetching RM delegation token..
21/11/17 11:46:28 INFO tony.TonyClient: RM delegation token fetched.
21/11/17 11:46:28 INFO tony.TonyClient: Fetching HDFS delegation tokens for default, history and other namenodes...
21/11/17 11:46:28 INFO hdfs.DFSClient: Created token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637120788205, maxDate=1637725588205, sequenceNumber=9640, masterKeyId=631 on ha-hdfs:testcs
21/11/17 11:46:28 INFO security.TokenCache: Got delegation token for hdfs://testcs; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:testcs, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/testcs@testcsKDC, renewer=yarn, realUser=, issueDate=1637120788205, maxDate=1637725588205, sequenceNumber=9640, masterKeyId=631)
21/11/17 11:46:28 INFO tony.TonyClient: Fetched HDFS delegation token.
21/11/17 11:46:28 INFO tony.TonyClient: Successfully fetched tokens.
21/11/17 11:46:28 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.ApplicationMaster 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
21/11/17 11:46:28 INFO tony.TonyClient: Submitting YARN application
21/11/17 11:46:28 INFO impl.TimelineClientImpl: Timeline service address: null
21/11/17 11:46:28 INFO impl.YarnClientImpl: Submitted application application_1623855961871_1205
21/11/17 11:46:28 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://test-hadoop01:8088/proxy/application_1623855961871_1205/
21/11/17 11:46:28 INFO tony.TonyClient: ResourceManager web address for application: http://test-hadoop01:8088/cluster/app/application_1623855961871_1205
21/11/17 11:46:34 INFO tony.TonyClient: Driver (application master) log url: http://test-hadoop01:8042/node/containerlogs/container_e75_1623855961871_1205_01_000001/hive
21/11/17 11:46:34 INFO tony.TonyClient: AM host: test-hadoop01
21/11/17 11:46:34 INFO tony.TonyClient: AM RPC port: 33102
21/11/17 11:46:38 INFO tony.TonyClient: ------  Application (re)starts, status of ALL tasks ------
21/11/17 11:46:38 INFO tony.TonyClient: RUNNING, worker, 0, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000002/hive
21/11/17 11:46:38 INFO tony.TonyClient: RUNNING, worker, 1, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000003/hive
21/11/17 11:46:38 INFO tony.TonyClient: RUNNING, worker, 2, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000004/hive

21/11/17 11:48:00 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/17 11:48:00 INFO tony.TonyClient: SUCCEEDED: worker [1, 2] 
21/11/17 11:48:00 INFO tony.TonyClient: RUNNING: worker [0] 
21/11/17 11:48:01 INFO tony.TonyClient: ------ Task Status Updated ------
21/11/17 11:48:01 INFO tony.TonyClient: SUCCEEDED: worker [0, 1, 2] 
21/11/17 11:48:05 INFO tony.TonyClient: -----  Application finished, status of ALL tasks -----
21/11/17 11:48:05 INFO tony.TonyClient: SUCCEEDED, worker, 0, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000002/hive
21/11/17 11:48:05 INFO tony.TonyClient: SUCCEEDED, worker, 1, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000003/hive
21/11/17 11:48:05 INFO tony.TonyClient: SUCCEEDED, worker, 2, http://test-hadoop02:8042/node/containerlogs/container_e75_1623855961871_1205_01_000004/hive
21/11/17 11:48:05 INFO tony.TonyClient: Application 1205 finished with YarnState=FINISHED, DSFinalStatus=SUCCEEDED, breaking monitoring loop.
21/11/17 11:48:05 INFO tony.TonyClient: Link for application_1623855961871_1205's events/metrics: https://localhost:19886/jobs/application_1623855961871_1205
21/11/17 11:48:05 INFO tony.TonyClient: Sending message to AM to stop.
21/11/17 11:48:05 INFO tony.TonyClient: Application completed successfully
21/11/17 11:48:05 INFO impl.YarnClientImpl: Killed application application_1623855961871_1205

I checked the three container worker (container_e75_1623855961871_1205_01_000002, container_e75_1623855961871_1205_01_000003, container_e75_1623855961871_1205_01_000004) logs, and I noticed that three container worker are doing same work(read file from hdfs, training data, and get different evaluate results), it's not like distributed training as a Hadoop application, just like running single in different container worker.

container_e75_1623855961871_1205_01_000002 log:

2021-11-17 11:46:41 INFO  Utils:765 - Unpacking src directory..
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping tony_src_application_1623855961871_1205.zip to destination ./
2021-11-17 11:46:41 INFO  Utils:770 - Unpacking Python virtual environment.. 
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping venv.zip to destination venv
2021-11-17 11:46:55 INFO  TaskExecutor:159 - Setting up application RPC client, connecting to: test-hadoop01:33102
2021-11-17 11:46:56 INFO  TaskExecutor:162 - Setting up metrics RPC client, connecting to: test-hadoop01:37956
2021-11-17 11:46:56 INFO  TaskMonitor:65 - Task pid is: 3420
2021-11-17 11:46:56 INFO  TaskMonitor:87 - Number of GPUs requested for worker: 0
2021-11-17 11:46:56 INFO  TaskExecutor:91 - Reserved rpcPort: 38513
2021-11-17 11:46:56 INFO  TaskExecutor:308 - TensorBoard address : test-hadoop02:42684
2021-11-17 11:46:56 INFO  Utils:139 - pollTillNonNull function finished within 60 seconds
2021-11-17 11:46:56 INFO  TaskExecutor:311 - Register TensorBoard response: SUCCEEDED
2021-11-17 11:46:56 INFO  TaskExecutor:98 - Reserved tbPort: 42684
2021-11-17 11:46:56 INFO  TaskExecutor:288 - ContainerId is: container_e75_1623855961871_1205_01_000002 HostName is: test-hadoop02
2021-11-17 11:46:56 INFO  TaskExecutor:294 - Connecting to test-hadoop01:33102 to register worker spec: worker 0 test-hadoop02:38513
2021-11-17 11:46:59 INFO  Utils:139 - pollTillNonNull function finished within 0 seconds
2021-11-17 11:46:59 INFO  TaskExecutor:183 - Successfully registered and got cluster spec: {"worker":["test-hadoop02:38513","test-hadoop02:42963","test-hadoop02:42187"]}
2021-11-17 11:46:59 INFO  TaskExecutor:206 - Releasing reserved RPC port before launching tensorflow process.
2021-11-17 11:46:59 INFO  TaskExecutor:211 - Releasing reserved TB port before launching tensorflow process.
2021-11-17 11:46:59 INFO  TFRuntime:46 - Setting up TensorFlow job...
....
782/782 - 1s - loss: 0.3304 - accuracy: 0.8724 - 553ms/epoch - 707us/step
evaluate results:[0.3303918242454529, 0.872439980506897]

container_e75_1623855961871_1205_01_000003 log:

2021-11-17 11:46:41 INFO  Utils:765 - Unpacking src directory..
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping tony_src_application_1623855961871_1205.zip to destination ./
2021-11-17 11:46:41 INFO  Utils:770 - Unpacking Python virtual environment.. 
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping venv.zip to destination venv
2021-11-17 11:46:55 INFO  TaskExecutor:159 - Setting up application RPC client, connecting to: test-hadoop01:33102
2021-11-17 11:46:55 INFO  TaskExecutor:162 - Setting up metrics RPC client, connecting to: test-hadoop01:37956
2021-11-17 11:46:55 INFO  TaskMonitor:65 - Task pid is: 3419
2021-11-17 11:46:55 INFO  TaskMonitor:87 - Number of GPUs requested for worker: 0
2021-11-17 11:46:55 INFO  TaskExecutor:91 - Reserved rpcPort: 42963
2021-11-17 11:46:55 INFO  TaskExecutor:288 - ContainerId is: container_e75_1623855961871_1205_01_000003 HostName is: test-hadoop02
2021-11-17 11:46:55 INFO  TaskExecutor:294 - Connecting to test-hadoop01:33102 to register worker spec: worker 1 test-hadoop02:42963
2021-11-17 11:46:58 INFO  Utils:139 - pollTillNonNull function finished within 0 seconds
2021-11-17 11:46:58 INFO  TaskExecutor:183 - Successfully registered and got cluster spec: {"worker":["test-hadoop02:38513","test-hadoop02:42963","test-hadoop02:42187"]}
2021-11-17 11:46:58 INFO  TaskExecutor:206 - Releasing reserved RPC port before launching tensorflow process.
2021-11-17 11:46:58 INFO  TaskExecutor:211 - Releasing reserved TB port before launching tensorflow process.
2021-11-17 11:46:58 INFO  TFRuntime:46 - Setting up TensorFlow job...
...
782/782 - 1s - loss: 0.3281 - accuracy: 0.8722 - 626ms/epoch - 801us/step
evaluate results:[0.3281101584434509, 0.872160017490387]

container_e75_1623855961871_1205_01_000004 log:

2021-11-17 11:46:41 INFO  Utils:765 - Unpacking src directory..
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping tony_src_application_1623855961871_1205.zip to destination ./
2021-11-17 11:46:41 INFO  Utils:770 - Unpacking Python virtual environment.. 
2021-11-17 11:46:41 INFO  Utils:179 - Unzipping venv.zip to destination venv
2021-11-17 11:46:56 INFO  TaskExecutor:159 - Setting up application RPC client, connecting to: test-hadoop01:33102
2021-11-17 11:46:57 INFO  TaskExecutor:162 - Setting up metrics RPC client, connecting to: test-hadoop01:37956
2021-11-17 11:46:57 INFO  TaskMonitor:65 - Task pid is: 3433
2021-11-17 11:46:57 INFO  TaskMonitor:87 - Number of GPUs requested for worker: 0
2021-11-17 11:46:57 INFO  TaskExecutor:91 - Reserved rpcPort: 42187
2021-11-17 11:46:57 INFO  TaskExecutor:288 - ContainerId is: container_e75_1623855961871_1205_01_000004 HostName is: test-hadoop02
2021-11-17 11:46:57 INFO  TaskExecutor:294 - Connecting to test-hadoop01:33102 to register worker spec: worker 2 test-hadoop02:42187
2021-11-17 11:46:57 INFO  Utils:139 - pollTillNonNull function finished within 0 seconds
2021-11-17 11:46:57 INFO  TaskExecutor:183 - Successfully registered and got cluster spec: {"worker":["test-hadoop02:38513","test-hadoop02:42963","test-hadoop02:42187"]}
2021-11-17 11:46:57 INFO  TaskExecutor:206 - Releasing reserved RPC port before launching tensorflow process.
2021-11-17 11:46:57 INFO  TaskExecutor:211 - Releasing reserved TB port before launching tensorflow process.
2021-11-17 11:46:57 INFO  TFRuntime:46 - Setting up TensorFlow job...
...
782/782 - 1s - loss: 0.3355 - accuracy: 0.8717 - 596ms/epoch - 762us/step
evaluate results:[0.3355417251586914, 0.8716800212860107]

Do I misunderstanding or where the configuration error caused this? how to let it doing distributed training. Thank you.

zuston commented 2 years ago

Simply speaking, TonY's implementation principle is to obtain the resources needed for training from Hadoop Yarn, such as machine nodes, memory, and CPU/GPU. After obtaining the machine resources (TonY task executor).

The distributed TensorFlow training needs to specify the tf_config environment configuration for each role(refer to https://www.tensorflow.org/guide/distributed_training). Therefore, when the above TonY task executors are completed, the tf_config of each role of TensorFlow will be provided by TonY.

Back to this ISSUE, why it runs like independent tasks? I think your tensorflow training code dont support distributed training mode, so recommend using the tf estimator or keras API.

LanstonWu commented 2 years ago

Thank you. I've learned.