Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
Splitting the entry into words solely by spaces and discarding long words can result in the omission of meaningful content from the corpus, especially when dealing with CJK languages.
To Reproduce
Create a text file, in every line, except for the first line, write a long word without any spaces. Repeat these lines to make the total length of the file longer than 500 characters.
First Line
dog=5
cat=10
apple=3
car=7
book=2
tree=9
pen=4
chair=8
phone=6
lamp=1
...more_lines_without_space...
Index this file, the compiled column contain only a single word "First".
Additional context
It's unusual to have such text in Latin languages, but it's possible in Chinese/Japanese where words are not separated by spaces. There is no space in the following text:
# Split entry into words
compiled_entry_words = [word for word in entry.compiled.split(" ") if word != ""]
# Drop long words instead of having entry truncated to maintain quality of entry processed by models
compiled_entry_words = [word for word in compiled_entry_words if len(word) <= max_word_length]
A possible workaround could involve splitting the entry into words not only by spaces but also by new lines. A good solution for this could be using langchain's RecursiveCharacterTextSplitter.
Describe the bug
Splitting the entry into words solely by spaces and discarding long words can result in the omission of meaningful content from the corpus, especially when dealing with CJK languages.
To Reproduce
Create a text file, in every line, except for the first line, write a long word without any spaces. Repeat these lines to make the total length of the file longer than 500 characters.
Index this file, the
compiled
column contain only a single word "First".Additional context
It's unusual to have such text in Latin languages, but it's possible in Chinese/Japanese where words are not separated by spaces. There is no space in the following text:
The issue is caused by the code in khoj/src/khoj/processor/content/text_to_entries.py
A possible workaround could involve splitting the entry into words not only by spaces but also by new lines. A good solution for this could be using langchain's
RecursiveCharacterTextSplitter
.