khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.63k stars 640 forks source link

[FIX] Splitting the entry into words solely by spaces results in the omission of content #620

Closed aleung closed 5 months ago

aleung commented 8 months ago

Describe the bug

Splitting the entry into words solely by spaces and discarding long words can result in the omission of meaningful content from the corpus, especially when dealing with CJK languages.

To Reproduce

Create a text file, in every line, except for the first line, write a long word without any spaces. Repeat these lines to make the total length of the file longer than 500 characters.

First Line
dog=5
cat=10
apple=3
car=7
book=2
tree=9
pen=4
chair=8
phone=6
lamp=1
...more_lines_without_space...

Index this file, the compiled column contain only a single word "First".

Additional context

It's unusual to have such text in Latin languages, but it's possible in Chinese/Japanese where words are not separated by spaces. There is no space in the following text:

Khoj是一个开源的AI个人助手,通过索引你的个人文本或图像数据来工作。你可以自行托管Khoj,并开始搜索和与你的组织、Markdown、PDF笔记聊天。Khoj正在努力帮助你充分利用你的数据。

需求人群:
用于搜索和管理个人知识库,支持搜索和与你的组织、Markdown、PDF笔记聊天等功能

产品特色:
快速语义搜索;
聊天功能;
自定义扩展插件;

The issue is caused by the code in khoj/src/khoj/processor/content/text_to_entries.py

            # Split entry into words
            compiled_entry_words = [word for word in entry.compiled.split(" ") if word != ""]

            # Drop long words instead of having entry truncated to maintain quality of entry processed by models
            compiled_entry_words = [word for word in compiled_entry_words if len(word) <= max_word_length]

A possible workaround could involve splitting the entry into words not only by spaces but also by new lines. A good solution for this could be using langchain's RecursiveCharacterTextSplitter.

debanjum commented 8 months ago

Thanks for such a detailed & organized bug report! ❤️

Your point makes sense. Let me try reproduce (and figure a fix for) the issue soon