intel / pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
MIT License
202 stars 57 forks source link

[Unitrace] Tool always aborts by assertion error in 'UniTracer::Create' when tried to profile on python scripts #64

Open xunsongh opened 6 months ago

xunsongh commented 6 months ago

I built unitrace tool on PVC machine with driver agama-ci-devel-hotfix-821.36 by default without MPI support, and then try to run this tool on a simple python script, but it always be aborted by the assertion error in UniTracer::Create.

Here is my command to run the successfully built unitrace tool:

./unitrace -h python ./simple.py
./unitrace --chrome-kernel-logging --chrome-dnn-logging --chrome-ccl-logging python ./simple.py

Also I tried other options in running but all of them failed on such an assertion error:

python: /home/gta/pti-gpu/tools/unitrace/src/tracer.h:50: static UniTracer* UniTracer::Create(const TraceOptions&): Assertion `status == ZE_RESULT_SUCCESS' failed.
Aborted (core dumped)

My test case is simplest as could:

if __name__ == '__main__':
    a = 1

Would you please help check why the unitrace tool crashed on such a simple case who is even not related to SYCL or L0?

Sarbojit2019 commented 5 months ago

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

xunsongh commented 5 months ago

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

Sarbojit2019 commented 5 months ago

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

Regarding your response to "Item 1" I doubt if this is related to python app. As per the failure point it looks to be at the very beginning. Lets connect internally to see the setup and failure.

zma2 commented 4 months ago

@xunsongh Please check the version of libstdc++.so in you conda env. If it is lower than 6.0.30, you need to upgrade it at least 6.0.30