Output for High Token Languages like Japanese

choprahetarth commented 8 months ago

While the concept is promising, especially for High Token Languages like Japanese, I've encountered a significant encoding issue.

Steps to Reproduce: Input a Japanese text prompt into LLMLingua for compression. Observe the output, which should be a compressed version of the original prompt. Expected Behavior: The compressed output should retain the original Japanese characters without any encoding errors.

Actual Behavior: The output contains a mix of unrecognized characters along with some correct Japanese script. This mixed encoding makes the compressed prompt unusable when passed into GPT-4.

iofu728 commented 8 months ago

Hi @choprahetarth, thank you for your interest in and support of LLMLingua.

This is a known issue, as seen in #4. We'll address it soon as detailed in #51.

choprahetarth commented 8 months ago

Is there anything I can contribute to? I seem to be interested in it quite a lot. My stack is around Python/ML/PyTorch, but I am not sure which issue to pick first.

microsoft / LLMLingua

Output for High Token Languages like Japanese #63