Namespace(verbose=True, batch_size_for_cuda_graph=1, chat_template='', model='.\\example-models\\phi2-int4-directml')
Loading model...
Model loaded
Tokenizer created
Prompt(s) encoded: ['I like walking my cute dog', 'What is the best restaurant in town?', 'Hello, how are you today?']
Args: Namespace(verbose=True, batch_size_for_cuda_graph=1, chat_template='', model='.\\example-models\\phi2-int4-directml')
Search options: {}
GeneratorParams created
Generating tokens ...
2024-06-21 11:01:47.9577714 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(1066)\onnxruntime.dll!00007FFA8D7BA2E1: (caller: 00007FFA8D849109) Exception(2) tid(72f0) 887A0005 GPU ?Traceback (most recent call last):
File "C:\Users\skyline\Projects\onnxruntime-genai\examples\python\model-generate.py", line 76, in <module>
main(args)
File "C:\Users\skyline\Projects\onnxruntime-genai\examples\python\model-generate.py", line 46, in main
output_tokens = model.generate(params)
onnxruntime_genai.onnxruntime_genai.OrtException
Reducing the number of prompts encoded leads to successful run:
Namespace(verbose=True, batch_size_for_cuda_graph=1, chat_template='', model='.\\example-models\\phi2-int4-directml')
Loading model...
Model loaded
Tokenizer created
Prompt(s) encoded: ['Hello, how are you today?']
Args: Namespace(verbose=True, batch_size_for_cuda_graph=1, chat_template='', model='.\\example-models\\phi2-int4-directml')
Search options: {}
GeneratorParams created
Generating tokens ...
Prompt #0: Hello, how are you today?
Hello, how are you today?
# The output of the program is
# The output of the program is
.......
Tokens: 1375 Time: 15.27 Tokens per second: 90.07
Running the default example doesn't work:
Reducing the number of prompts encoded leads to successful run: