openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.28k stars 2.27k forks source link

[Bug]: [get_profiling_info]:Why is the output of my openvino.runtime.ProfilingInfo all Zeros? #24885

Closed HPUedCSLearner closed 5 months ago

HPUedCSLearner commented 5 months ago

OpenVINO Version

openvino24.1

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

None

Model used

LLM : qwen

Issue description

I want to get the execution time of operators when LLM uses OpenVINO for inference on the CPU, just like a Pytorch proifler, but the values ​​I get are all zero. Am I doing something wrong?

Step-by-step reproduction

code code:

import numpy as np
from pathlib import Path
import openvino_tokenizers
from openvino import compile_model, Tensor

model_path = "/juicefs-algorithm/workspace/vision/yongshuai_wang/models/hf-models/openvino-optimum-models/ov_qwen1.5_0.5B"
# model_path = "/juicefs-algorithm/workspace/vision/yongshuai_wang/models/hf-models/openvino-optimum-models/ov_qwen1.5_4B-chat-int8"

# Stage 1
model_dir = Path(model_path)
# Compile the tokenizer, model, and detokenizer using OpenVINO. These files are XML 
# representations of the models optimized for OpenVINO
tokenizer = compile_model(model_dir / "openvino_tokenizer.xml", "CPU", {"PERFORMANCE_HINT": "THROUGHPUT"})
detokenizer = compile_model(model_dir / "openvino_detokenizer.xml", "CPU", {"PERFORMANCE_HINT": "THROUGHPUT"})
infer_request = compile_model(model_dir / "openvino_model.xml", "CPU", {"PERFORMANCE_HINT": "THROUGHPUT"}).create_infer_request()

# Stage 2
text_input = ["你好,请问你是谁?"]
model_input = {name.any_name: Tensor(output) for name, output in tokenizer(text_input).items()}
print(model_input)

# Stage 3
if "position_ids" in (input.any_name for input in infer_request.model_inputs):
    model_input["position_ids"] = np.arange(model_input["input_ids"].shape[1], dtype=np.int64)[np.newaxis, :]

# no beam search, set idx to 0
model_input["beam_idx"] = Tensor(np.array(range(len(text_input)), dtype=np.int32))

# end of sentence token - the model signifies the end of text generation
# for now can be obtained from the original tokenizer `original_tokenizer.eos_token_id`
eos_token = 2
tokens_result = [[]]

# reset KV cache inside the model before inference
infer_request.reset_state()

max_infer = 300
for _ in range(max_infer):
    infer_request.start_async(model_input)
    infer_request.wait()
    # use greedy decoding to get most probable token as the model prediction
    output_token = np.argmax(infer_request.get_output_tensor().data[:, -1, :], axis=-1, keepdims=True)
    tokens_result = np.hstack((tokens_result, output_token))

    if output_token[0][0] == eos_token:
        break

    # Prepare input for new inference
    model_input["input_ids"] = output_token
    model_input["attention_mask"] = np.hstack((model_input["attention_mask"].data, [[1]]))
    model_input["position_ids"] = np.hstack((model_input["position_ids"].data, [[model_input["position_ids"].data.shape[-1]]]))

# Stage 4
# Decode the model output back to string
text_result = detokenizer(tokens_result)["string_output"]
print(f"Prompt:\n{text_input[0]}")
print(f"Generated:\n{text_result[0]}")

# Print performance statistics
print("\nPerformance Counters:")
for perf_counter in infer_request.get_profiling_info():
    print(f"Layer: {perf_counter.node_name}")
    print(f"  Status: {perf_counter.status}")
    print(f"  Real Time (ms): {perf_counter.real_time.total_seconds() * 1000000:.8f}")
    print(f"  CPU Time (ms): {perf_counter.cpu_time.total_seconds() * 1000000:.8f}")

run.log

Relevant log output

CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::mean/ReduceMean
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::add/Add
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::rsqrt/Sqrt
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::rsqrt/Divide
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::mul/Multiply
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model.norm/aten::mul/Multiply_1
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: aten::to/Convert
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: Result_58346
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000

Issue submission checklist

ilya-lavrenov commented 5 months ago

You need to pass props.enable_profiling(True) to compile_model

Profiling is not enabled by default, because on some devices it costs total inference time. Hence, should be explicitly enabled only during development phase.

HPUedCSLearner commented 5 months ago

Thank you very much for your answer, but I tried it in Python and it still doesn't work. Maybe I set it wrong. Can you give detailed instructions or codes on how to set this property in Python or how to set it in C++?

akuporos commented 5 months ago

Hello @HPUedCSLearner,

Try to set properties in your Python code like this:

import openvino as ov
from openvino import properties as props

comp_model = ov.compile_model(model, config={props.enable_profiling : True, props.hint.performance_mode: props.hint.PerformanceMode.THROUGHPUT})

Also please check this document. There is described how to set properties in both C++ and Python applications.

HPUedCSLearner commented 5 months ago

Thanks a lot,I got it

 CPU Time (ms): 0.00100000
Layer: Broadcast_73907
  Status: Status.EXECUTED
  Real Time (ms): 0.00200000
  CPU Time (ms): 0.00200000
Layer: __module.model/prim::ListConstruct/Concat_2
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model/aten::zeros/Broadcast
  Status: Status.EXECUTED
  Real Time (ms): 0.00100000
  CPU Time (ms): 0.00100000
Layer: __module.model/aten::cat/Concat
  Status: Status.EXECUTED
  Real Time (ms): 0.00200000
  CPU Time (ms): 0.00200000
Layer: ShapeOf_79614
  Status: Status.NOT_RUN
  Real Time (ms): 0.00000000
  CPU Time (ms): 0.00000000
Layer: __module.model/aten::ones_like/Broadcast
  Status: Status.EXECUTED
  Real Time (ms): 0.00200000
  CPU Time (ms): 0.00200000
Layer: __module.model/aten::add/Add_2