Closed jianan-gu closed 3 months ago
I'll look into it.
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
ProfilerStep -6.49% -247365.000us 100.00% 3.813s 3.813s 1
This issue should happen on very performant CPU only, that means, it's hard to reproduce on ordinary CPU. But, due to CPU-related, it will happen on any accelerator + CPU systems, in PyTorch version earlier than or equal to 2.3.
The root-cause should be in profiler post-process in PyTorch, which truncates the timestamps during converting from ns to us, instead of rounding.
Let's think about this case, we have 3 operators, grandparent_op, parent_op, and child_op. Obviously, the timeline between these 3 ops should be, grandparent_op_start --> parent_op_start --> child_op_start --> child_op_end --> parent_op_end --> grandparent_op_end
. And, grandparent_op_duration = grandparent_op_end - grandparent_op_start - parent_duration.
In some case, PyTorch post-process will disorder the relationship between these 3 ops. It will take the child_op as the child of grandparent_op, instead of parent_op, because the child_op_end is later than parent_op_end.
https://github.com/pytorch/pytorch/blob/6181e65cd81725efc6bc5d64ef3be607b0aa3ca1/torch/autograd/profiler_util.py#L124-L125
Then, the timeline will be, grandparent_op_start --> parent_op_start --> child_op_start --> parent_op_end --> child_op_end --> grandparent_op_end
. In result, grandparent_op_duration = grandparent_op_end - grandparent_op_start - parent_op_duration - child_op_duration, will become a negative value.
Why the parent_op_end is earler than child_op_end? Because the end timestamp of operator is the result of twice computations, instead of from raw timestamp, that is, op_end = op_start + op_duration, while op_duration = raw_op_end - raw_op_start with truncation. https://github.com/pytorch/pytorch/blob/a4ef9cdd2807e7138e29d12ff03b48f60e1a5189/torch/csrc/autograd/profiler_kineto.cpp#L775-L777 https://github.com/pytorch/pytorch/blob/a4ef9cdd2807e7138e29d12ff03b48f60e1a5189/torch/autograd/profiler.py#L470
For example, we have parent_start_ns = 315877, parent_end_ns = 486764, child_start_ns = 319059, and child_end_ns = 486499
.
Then, parent_duration_us = (486764 - 315877) / 1000 = 170, child_duration_us = (486499 - 319059) / 1000 = 167
.
And, parent_start_us = 315877 / 1000 = 315, parent_end_us = parent_start_us + parent_duration_us = 485
.
But, child_start_us = 319059 / 1000 = 319, child_end_us = child_start_us + child_duration_us = 486
.
Obviously, the child_end_us is later than parent_end_us.
Fortunately, in latest PyTorch master, the profiler post-process promoted the timestamp precision from us to ns, which bypass-ed this issue. https://github.com/pytorch/pytorch/pull/123650 But it's still a potential issue if improving the precision and adopting the truncation again. Therefore, we will submit a PR to use the raw_op_end to check the parent relationship, instead of computation result.
Associated fix PR: https://github.com/pytorch/pytorch/pull/129554
🐛 Describe the bug
We found there are negative numbers in PyTorch profilings, which is inconvenient for users to get solid profiling for operators:
Self CPU time total: 3.813s
Versions
Collecting environment information... PyTorch version: 2.1.0.dev20230518+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: CentOS Stream 8 (x86_64) GCC version: (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1) Clang version: 14.0.0 (Red Hat 14.0.0-1.module_el8.7.0+1142+5343df54) CMake version: version 3.22.1 Libc version: glibc-2.28
Python version: 3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-365.el8.x86_64-x86_64-with-glibc2.17 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] torch==2.1.0.dev20230518+cpu [conda] blas 1.0 mkl [conda] mkl 2023.1.0 h6d00ec8_46342 [conda] mkl-include 2023.1.0 pypi_0 pypi [conda] mkl-service 2.4.0 py38h5eee18b_1 [conda] mkl-static 2023.1.0 pypi_0 pypi [conda] mkl_fft 1.3.6 py38h417a72b_1 [conda] mkl_random 1.2.2 py38h417a72b_1 [conda] numpy 1.24.3 py38hf6e8229_1 [conda] numpy-base 1.24.3 py38h060ed82_1 [conda] torch 2.1.0.dev20230518+cpu pypi_0 pypi
cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @dzhulgakov @davidberard98