Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
(llama-17oct) user@BA-ARCH-LAB-SPR-PVC-2T:~/17oct/frameworks.ai.pytorch.gpu-models/LLM/generation$ /home/user/17oct/pti-gpu/tools/oneprof/build/./oneprof -q -o newlog_llama7b_oneprof_q_O_log.txt -p /home/user/17oct/oneprof_temp/ -s 1000 python -u run_generation.py --device xpu --ipex --dtype float16 --input-tokens 32 --max-new-tokens 32 --num-beam 1 --benchmark -m decapoda-research/llama-7b-hf --sub-model-name llama-7b
Namespace(model_id='decapoda-research/llama-7b-hf', sub_model_name='llama-7b', device='xpu', dtype='float16', input_tokens='32', max_new_tokens=32, prompt=None, greedy=False, ipex=True, jit=False, profile=False, benchmark=True, lambada=False, dataset='lambada', accuracy_only=False, num_beam=1, num_iter=10, num_warmup=3, batch_size=1, token_latency=False, print_memory=False, disable_optimize_transformers=False, woq=False, calib_dataset='wikitext2', calib_group_size=-1, calib_output_dir='./', calib_checkpoint_name='quantized_weight.pt', calib_nsamples=128, calib_wbits=4, calib_seed=0, woq_checkpoint_path='')
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:36<00:00, 1.11s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
python: /home/user/17oct/pti-gpu/tools/oneprof/metric_query_cache.h:69: _zet_metric_query_handle_t* MetricQueryCache::GetQuery(ze_context_handle_t): Assertion `status == ZE_RESULT_SUCCESS' failed.
Basically the issue with "-q" option. Seems to be running fine with "-k" option.
Can you pls check on priority. This is blocking analysis of LLM workloads.
(llama-17oct) user@BA-ARCH-LAB-SPR-PVC-2T:~/17oct/frameworks.ai.pytorch.gpu-models/LLM/generation$ /home/user/17oct/pti-gpu/tools/oneprof/build/./oneprof -q -o newlog_llama7b_oneprof_q_O_log.txt -p /home/user/17oct/oneprof_temp/ -s 1000 python -u run_generation.py --device xpu --ipex --dtype float16 --input-tokens 32 --max-new-tokens 32 --num-beam 1 --benchmark -m decapoda-research/llama-7b-hf --sub-model-name llama-7b Namespace(model_id='decapoda-research/llama-7b-hf', sub_model_name='llama-7b', device='xpu', dtype='float16', input_tokens='32', max_new_tokens=32, prompt=None, greedy=False, ipex=True, jit=False, profile=False, benchmark=True, lambada=False, dataset='lambada', accuracy_only=False, num_beam=1, num_iter=10, num_warmup=3, batch_size=1, token_latency=False, print_memory=False, disable_optimize_transformers=False, woq=False, calib_dataset='wikitext2', calib_group_size=-1, calib_output_dir='./', calib_checkpoint_name='quantized_weight.pt', calib_nsamples=128, calib_wbits=4, calib_seed=0, woq_checkpoint_path='') Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:36<00:00, 1.11s/it] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 python: /home/user/17oct/pti-gpu/tools/oneprof/metric_query_cache.h:69: _zet_metric_query_handle_t* MetricQueryCache::GetQuery(ze_context_handle_t): Assertion `status == ZE_RESULT_SUCCESS' failed.
Basically the issue with "-q" option. Seems to be running fine with "-k" option. Can you pls check on priority. This is blocking analysis of LLM workloads.