Closed jasperzhong closed 3 years ago
I Still got an error with the following command.
python3 analyze.py --option optimize --platform TENSORFLOW --comm_backend NCCL --nccl_algo RING --pretty --path capture_file_tf/run0 --workspace capture_file_tf --comm_backend BYTEPS
[2021-01-03 15:02:42] [analyze.py:16] INFO - Namespace(ckpt=False, clean=False, comm_backend='BYTEPS', cost_model_tmp_dir='./', debug_traces=False, del_queue=False, delay_ratio=1.1, disable_revise=False, filter=None, force=False, full_trace=False, head=None, heat_window_size=5, logging_level='INFO', mcmc_beta=100, metadata_path=None, nccl_algo='RING', no_mutation=False, optimizer='MCMC', option='optimize', path='capture_file_tf/run0', pcap_file_path=None, platform='TENSORFLOW', pretty=True, profile_duration=None, profile_start_step=None, progress=False, relabel=False, server_log_path=None, show_queue=False, simulate=False, sort=False, step_num=1, sub_option=None, trace_level='info', ucb_gamma=0.1, ucb_type='AVG', ucb_visual=False, update_barrier=False, workspace='capture_file_tf', xlsx=False, zmq_log_path=None)
[2021-01-03 15:02:44] [dataloader.py:19] INFO - Use TENSORFLOW metadata
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:11: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:15: The name tf.NodeDef is deprecated. Please use tf.compat.v1.NodeDef instead.
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:23: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2021-01-03 15:02:45.900784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-03 15:02:45.956879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
2021-01-03 15:02:45.957792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
2021-01-03 15:02:45.958089: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-03 15:02:45.959618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-03 15:02:45.960960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-03 15:02:45.961303: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-03 15:02:45.963276: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-03 15:02:45.964829: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-03 15:02:45.969345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-03 15:02:45.972714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
Set cost model to use GPU 1
/home/yuchen/repos/byteprofile-analysis/capture_file_tf/run0
Traceback (most recent call last):
File "analyze.py", line 126, in <module>
clct = Collector(path_list[0], comm_backend=args_.comm_backend, platform=args.platform)
File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 70, in __init__
self.pm = PathManager(root_path)
File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 751, in __init__
self.dir_level = self.get_dir_level(self.path)
File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 772, in get_dir_level
level = recur_look_up(_dir)
File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 759, in recur_look_up
root, dirs, files = list(os.walk(_d))[0]
IndexError: list index out of range
I tried this command but still got an error.
python3 analyze.py --option optimize --platform TENSORFLOW --nccl_algo RING --pretty --path bert_bps_jan2/run_0 --workspace bert_bps_jan2 --comm_backend BYTEPS --zmq_log_path bert_bps_jan2/zmq_logs --server_log_path bert_bps_jan2/server_logs --profile_start_step 10 --profile_duration 10
[2021-01-03 15:28:51] [analyze.py:16] INFO - Namespace(ckpt=False, clean=False, comm_backend='BYTEPS', cost_model_tmp_dir='./', debug_traces=False, del_queue=False, delay_ratio=1.1, disable_revise=False, filter=None, force=False, full_trace=False, head=None, heat_window_size=5, logging_level='INFO', mcmc_beta=100, metadata_path=None, nccl_algo='RING', no_mutation=False, optimizer='MCMC', option='optimize', path='bert_bps_jan2/run_0', pcap_file_path=None, platform='TENSORFLOW', pretty=True, profile_duration=10, profile_start_step=10, progress=False, relabel=False, server_log_path='bert_bps_jan2/server_logs', show_queue=False, simulate=False, sort=False, step_num=1, sub_option=None, trace_level='info', ucb_gamma=0.1, ucb_type='AVG', ucb_visual=False, update_barrier=False, workspace='bert_bps_jan2', xlsx=False, zmq_log_path='bert_bps_jan2/zmq_logs')
[2021-01-03 15:28:53] [dataloader.py:19] INFO - Use TENSORFLOW metadata
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:11: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:15: The name tf.NodeDef is deprecated. Please use tf.compat.v1.NodeDef instead.
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:23: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2021-01-03 15:28:54.879739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-03 15:28:54.915234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
2021-01-03 15:28:54.916219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
2021-01-03 15:28:54.916594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-03 15:28:54.918498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-03 15:28:54.920137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-03 15:28:54.920516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-03 15:28:54.922646: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-03 15:28:54.924411: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-03 15:28:54.929265: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-03 15:28:54.932690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
[2021-01-03 15:28:54] [graph.py:115] INFO - [BYTEPS] Using profile_start_step = 10.
[2021-01-03 15:28:54] [graph.py:121] INFO - [BYTEPS] Using profile_duration = 10.
[2021-01-03 15:28:54] [trace_utils.py:802] WARNING - Fail to find bps_trace_final.json in path /home/yuchen/repos/byteprofile-analysis/bert_bps_jan2/run_0
[2021-01-03 15:28:54] [collect.py:836] INFO - Inited BytePS graph helper from cache.
[2021-01-03 15:28:54] [parameter.py:21] INFO - Use TENSORFLOW metadata
Set cost model to use GPU 1
Traceback (most recent call last):
File "analyze.py", line 127, in <module>
iter_time = clct.init(args.force)
File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 903, in init
self.collect_para_dict()
File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 760, in collect_para_dict
self.para_dict = ParameterDict(self.pm, self.platform)
File "/home/yuchen/repos/byteprofile-analysis/parameter.py", line 25, in __init__
self.metainfo = MetaInfo(metadata_path)
File "/home/yuchen/repos/byteprofile-analysis/ml_platform/tensorflow/metadata.py", line 29, in __init__
with open(os.path.join(meta_dir, FileName.TENSOR_NAME.value), 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/home/yuchen/repos/byteprofile-analysis/bert_bps_jan2/run_0/traces_0/0/gradient_name_list.json'
I tried to run the script but got this error. Here is my command.
log
Here is my trace directory structure.