microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.27k stars 3.12k forks source link

Function inputs and escaping special characters with Unicode #7365

Closed sophialagerkranspandey closed 1 month ago

sophialagerkranspandey commented 1 month ago

Discussed in https://github.com/microsoft/semantic-kernel/discussions/7308

Originally posted by **glorious-beard** July 16, 2024 If I set a kernel argument to content containing special characters, (HTML tags, for example), and I look at the logger output from the kernel when it's invoking the function, I notice that the JSON object escapes all of the special character. For example, if I set "input" to `

Version 1.2

....`, the function argument looks like: ```json {"input":"\u003Cp\u003EVersion 1.2\u003C/p\u003E..."} ``` Two questions: 1. Do the extra characters in escaping "<" and ">" with 5 additional characters incur extra token cost? 2. Does the function call unescape these characters before it is sent to the LLM endpoint?
matthewbolanos commented 1 month ago

I'm pretty sure that the characters are only encoded so we can print the log statement (so it shouldn't impact your logic), but adding folks to verify.

eavanvalkenburg commented 1 month ago

In python, I can confirm, they are unescaped before being sent to the model, this happens within the from_element method for chat, and within the _invoke_internal method for text, hence it also does not add extra tokens (although tokenization on the model side might). @sophialagerkranspandey @glorious-beard

markwallace-microsoft commented 1 month ago

We have protection to prevent prompt injection attacks which will encode potentially dangerous tags. If you trust the content you can change this behaviour, take a look at this sample to see the available options: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/ChatPrompts/SafeChatPrompts.cs

matthewbolanos commented 1 month ago

Closing this issue since it's handled in both C# and Python