Open mumar-intel opened 1 year ago
@mumar-intel sorry for responding in such a delay.
recently there were several fixes in oneprof
. Can you please try the collection with the recent oneprof and tell if it still reproduced? thank you.
hi, @jfedorov , i also run into this issue, and i updated to latest commit(9ee0e46cafa145856eaeeefe5f26ec046462300f), below is the error info, is it expected?
pti-gpu/tools/oneprof/metric_query_cache.h:69: _zet_metric_query_handle_t* MetricQueryCache::GetQ
uery(ze_context_handle_t): Assertion `status == ZE_RESULT_SUCCESS' failed.
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: spr [Genuine Intel(R) CPU 0000%@]
Registry and code: 13 MB
Command: python test_linear.py
Uptime: 7.938176 s
Aborted (core dumped)
I am using oneprof on one HPC+AI application with large number of kernels (~30). When I run: oneprof -q -o test.txt $APP_EXE It fails with error: oneprof/metric_query_collector.h:307: void MetricQueryCollector::ProcessQuery(const ZeQueryInfo&): Assertion `status == ZE_RESULT_SUCCESS' failed
It generates the output files (result. data, and test.txt) but the test.txt contains just the application total runtime and provides no information about the individual kernels.
I have tested it one tile, and one GPU. The application does not use MPI, it is a Python based code.