joapolarbear / dpro

Analysis for the traces from byteprofile
29 stars 2 forks source link

AssertionError: No explicit directory found under server_logs #8

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

I tried to run the script but got this error. Here is my command.

python3 analyze.py --option optimize --platform TENSORFLOW --comm_backend NCCL --nccl_algo RING --pretty --path capture_file_tf --workspace capture_file_tf

log

[2021-01-03 14:49:35] [analyze.py:16] INFO - Namespace(ckpt=False, clean=False, comm_backend='NCCL', cost_model_tmp_dir='./', debug_traces=False, del_queue=False, delay_ratio=1.1, disable_revise=False, filter=None, force=False, full_trace=False, head=None, heat_window_size=5, logging_level='INFO', mcmc_beta=100, metadata_path=None, nccl_algo='RING', no_mutation=False, optimizer='MCMC', option='optimize', path='capture_file_tf', pcap_file_path=None, platform='TENSORFLOW', pretty=True, profile_duration=None, profile_start_step=None, progress=False, relabel=False, server_log_path=None, show_queue=False, simulate=False, sort=False, step_num=1, sub_option=None, trace_level='info', ucb_gamma=0.1, ucb_type='AVG', ucb_visual=False, update_barrier=False, workspace='capture_file_tf', xlsx=False, zmq_log_path=None)
[2021-01-03 14:49:37] [dataloader.py:19] INFO - Use TENSORFLOW metadata
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:11: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:15: The name tf.NodeDef is deprecated. Please use tf.compat.v1.NodeDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:23: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-01-03 14:49:38.839543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-03 14:49:38.875380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
2021-01-03 14:49:38.876548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
2021-01-03 14:49:38.877014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-03 14:49:38.878811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-03 14:49:38.880434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-03 14:49:38.880841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-03 14:49:38.882994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-03 14:49:38.884649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-03 14:49:38.889754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-03 14:49:38.893663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
Set cost model to use GPU 1
/home/yuchen/repos/byteprofile-analysis/capture_file_tf
Traceback (most recent call last):
  File "analyze.py", line 126, in <module>
    clct = Collector(path_list[0], comm_backend=args_.comm_backend, platform=args.platform)
  File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 70, in __init__
    self.pm = PathManager(root_path)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 751, in __init__
    self.dir_level = self.get_dir_level(self.path)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 772, in get_dir_level
    level = recur_look_up(_dir)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 770, in recur_look_up
    return 1 + recur_look_up(os.path.join(root, target_dir))
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 769, in recur_look_up
    assert target_dir is not None, "No explicit directory found under {}".format(root)
AssertionError: No explicit directory found under /home/yuchen/repos/byteprofile-analysis/capture_file_tf/server_logs

Here is my trace directory structure.

capture_file_tf
├── collect_data.sh
├── comm_traces
│   ├── server_0.pcap
│   ├── server_1.pcap
│   ├── worker_0.pcap
│   └── worker_1.pcap
├── log_option-optimize.txt
├── run_0
│   ├── bps_cache.pickle
│   ├── bps_comm_aligned.json
│   ├── bps_trace_final.json
│   ├── comm_timeline.json
│   ├── ip_to_rank.txt
│   ├── log_option-replay.txt
│   ├── server_timeline.json
│   ├── statistic.txt
│   ├── synthetic.json
│   ├── traces_0
│   │   ├── 0
│   │   │   ├── comm.json
│   │   │   ├── dag.gml
│   │   │   ├── final_graph.json
│   │   │   ├── final_graph.pbtxt
│   │   │   ├── run_meta.json
│   │   │   ├── temp.json
│   │   │   ├── temp_json.tar
│   │   │   ├── tensor_shapes.json
│   │   │   └── variables_meta.json
│   │   └── key_dict.txt
│   ├── traces_1
│   │   └── 0
│   │       ├── comm.json
│   │       ├── dag.gml
│   │       ├── final_graph.json
│   │       ├── graph.json
│   │       ├── run_meta.json
│   │       ├── temp.json
│   │       ├── temp_json.tar
│   │       ├── tensor_shapes.json
│   │       └── variables_meta.json
│   └── trail_dag.gml
└── server_logs
    ├── server_log_0.txt
    └── server_log_1.txt

7 directories, 37 files
jasperzhong commented 3 years ago

I Still got an error with the following command.

python3 analyze.py --option optimize --platform TENSORFLOW --comm_backend NCCL --nccl_algo RING --pretty --path capture_file_tf/run0 --workspace capture_file_tf --comm_backend BYTEPS
[2021-01-03 15:02:42] [analyze.py:16] INFO - Namespace(ckpt=False, clean=False, comm_backend='BYTEPS', cost_model_tmp_dir='./', debug_traces=False, del_queue=False, delay_ratio=1.1, disable_revise=False, filter=None, force=False, full_trace=False, head=None, heat_window_size=5, logging_level='INFO', mcmc_beta=100, metadata_path=None, nccl_algo='RING', no_mutation=False, optimizer='MCMC', option='optimize', path='capture_file_tf/run0', pcap_file_path=None, platform='TENSORFLOW', pretty=True, profile_duration=None, profile_start_step=None, progress=False, relabel=False, server_log_path=None, show_queue=False, simulate=False, sort=False, step_num=1, sub_option=None, trace_level='info', ucb_gamma=0.1, ucb_type='AVG', ucb_visual=False, update_barrier=False, workspace='capture_file_tf', xlsx=False, zmq_log_path=None)
[2021-01-03 15:02:44] [dataloader.py:19] INFO - Use TENSORFLOW metadata
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:11: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:15: The name tf.NodeDef is deprecated. Please use tf.compat.v1.NodeDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:23: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-01-03 15:02:45.900784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-03 15:02:45.956879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
2021-01-03 15:02:45.957792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
2021-01-03 15:02:45.958089: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-03 15:02:45.959618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-03 15:02:45.960960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-03 15:02:45.961303: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-03 15:02:45.963276: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-03 15:02:45.964829: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-03 15:02:45.969345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-03 15:02:45.972714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
Set cost model to use GPU 1
/home/yuchen/repos/byteprofile-analysis/capture_file_tf/run0
Traceback (most recent call last):
  File "analyze.py", line 126, in <module>
    clct = Collector(path_list[0], comm_backend=args_.comm_backend, platform=args.platform)
  File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 70, in __init__
    self.pm = PathManager(root_path)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 751, in __init__
    self.dir_level = self.get_dir_level(self.path)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 772, in get_dir_level
    level = recur_look_up(_dir)
  File "/home/yuchen/repos/byteprofile-analysis/trace_utils.py", line 759, in recur_look_up
    root, dirs, files = list(os.walk(_d))[0]
IndexError: list index out of range
jasperzhong commented 3 years ago

I tried this command but still got an error.

python3 analyze.py --option optimize --platform TENSORFLOW --nccl_algo RING --pretty --path bert_bps_jan2/run_0 --workspace bert_bps_jan2 --comm_backend BYTEPS --zmq_log_path bert_bps_jan2/zmq_logs --server_log_path bert_bps_jan2/server_logs --profile_start_step 10 --profile_duration 10
[2021-01-03 15:28:51] [analyze.py:16] INFO - Namespace(ckpt=False, clean=False, comm_backend='BYTEPS', cost_model_tmp_dir='./', debug_traces=False, del_queue=False, delay_ratio=1.1, disable_revise=False, filter=None, force=False, full_trace=False, head=None, heat_window_size=5, logging_level='INFO', mcmc_beta=100, metadata_path=None, nccl_algo='RING', no_mutation=False, optimizer='MCMC', option='optimize', path='bert_bps_jan2/run_0', pcap_file_path=None, platform='TENSORFLOW', pretty=True, profile_duration=10, profile_start_step=10, progress=False, relabel=False, server_log_path='bert_bps_jan2/server_logs', show_queue=False, simulate=False, sort=False, step_num=1, sub_option=None, trace_level='info', ucb_gamma=0.1, ucb_type='AVG', ucb_visual=False, update_barrier=False, workspace='bert_bps_jan2', xlsx=False, zmq_log_path='bert_bps_jan2/zmq_logs')
[2021-01-03 15:28:53] [dataloader.py:19] INFO - Use TENSORFLOW metadata
WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:11: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:15: The name tf.NodeDef is deprecated. Please use tf.compat.v1.NodeDef instead.

WARNING:tensorflow:From /home/yuchen/repos/byteprofile-analysis/cost_model_xla/gen_dataset_utils.py:23: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-01-03 15:28:54.879739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-03 15:28:54.915234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
2021-01-03 15:28:54.916219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
2021-01-03 15:28:54.916594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-03 15:28:54.918498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-03 15:28:54.920137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-03 15:28:54.920516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-03 15:28:54.922646: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-03 15:28:54.924411: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-03 15:28:54.929265: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-03 15:28:54.932690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
[2021-01-03 15:28:54] [graph.py:115] INFO - [BYTEPS] Using profile_start_step = 10.
[2021-01-03 15:28:54] [graph.py:121] INFO - [BYTEPS] Using profile_duration = 10.
[2021-01-03 15:28:54] [trace_utils.py:802] WARNING - Fail to find bps_trace_final.json in path /home/yuchen/repos/byteprofile-analysis/bert_bps_jan2/run_0
[2021-01-03 15:28:54] [collect.py:836] INFO - Inited BytePS graph helper from cache.
[2021-01-03 15:28:54] [parameter.py:21] INFO - Use TENSORFLOW metadata
Set cost model to use GPU 1
Traceback (most recent call last):
  File "analyze.py", line 127, in <module>
    iter_time = clct.init(args.force)
  File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 903, in init
    self.collect_para_dict()
  File "/home/yuchen/repos/byteprofile-analysis/collect.py", line 760, in collect_para_dict
    self.para_dict = ParameterDict(self.pm, self.platform)
  File "/home/yuchen/repos/byteprofile-analysis/parameter.py", line 25, in __init__
    self.metainfo = MetaInfo(metadata_path)
  File "/home/yuchen/repos/byteprofile-analysis/ml_platform/tensorflow/metadata.py", line 29, in __init__
    with open(os.path.join(meta_dir, FileName.TENSOR_NAME.value), 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/home/yuchen/repos/byteprofile-analysis/bert_bps_jan2/run_0/traces_0/0/gradient_name_list.json'