microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
263 stars 57 forks source link

/sys/fs/cgroup/cpuacct/cpuacct missing causing superbench failures #436

Closed amathews-amd closed 1 year ago

amathews-amd commented 1 year ago

Docker container: nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04 GPU 0: NVIDIA A100 80GB PCIe

[2022-11-23T15:02:17.260Z] Running model...

[2022-11-23T15:02:17.260Z] > docker exec dd7780c3a5f9 bash -c "cd superbenchmark && bash run.sh atoa_small_hayabusa.yaml atoa_small_hayabusa_performance.csv"

[2022-11-23T15:18:24.479Z] NVIDIA GPU detected.

[2022-11-23T15:18:24.479Z] sb exec --config-file   atoa_small_ndv4.yaml    2>&1 | tee log.txt

[2022-11-23T15:18:24.479Z] [2022-11-23 15:02:18,200 rocm-framework-a100-1:400][executor.py:224][INFO] Executor is going to execute gpt_models/pytorch-gpt2-small.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,202 rocm-framework-a100-1:466][monitor.py:100][INFO] Start monitoring.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,203 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,205 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,206 rocm-framework-a100-1:466][monitor.py:105][ERROR] Failed to launch the monitor process - error message: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] Process Monitor-1:

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 102, in run

[2022-11-23T15:18:24.480Z]     self.__sample()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 126, in __sample

[2022-11-23T15:18:24.480Z]     self.__sample_host_metrics(record)

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 152, in __sample_host_metrics

[2022-11-23T15:18:24.480Z]     cpu_usage = (container_ticks_e -

[2022-11-23T15:18:24.480Z] TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] During handling of the above exception, another exception occurred:

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap

[2022-11-23T15:18:24.480Z]     self.run()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 106, in run

[2022-11-23T15:18:24.480Z]     self.stop()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 117, in stop

[2022-11-23T15:18:24.480Z]     self.join()

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 147, in join

[2022-11-23T15:18:24.480Z]     assert self._parent_pid == os.getpid(), 'can only join a child process'

[2022-11-23T15:18:24.480Z] AssertionError: can only join a child process

https://github.com/microsoft/superbenchmark/blob/6e357fb9d2038dabd4e2c07854c92ca7b0805cee/superbench/monitor/monitor.py#L83

yukirora commented 1 year ago

Hi, the issue is "/sys/fs/cgroup/cpuacct/cpuacct" missing in your current enviroment causing Superbench monitor failures. Could you please temporarily disable monitor feature by changing the config file "atoa_small_ndv4.yaml" from

# SuperBench Config
version: v0.4
superbench:
  enable: null
  monitor:
    enable: true
    sample_duration: 1
    sample_interval: 10

to

# SuperBench Config
version: v0.4
superbench:
  enable: null
  monitor:
    enable: false
    sample_duration: 1
    sample_interval: 10