khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.64k stars 640 forks source link

Error indexing Obsidian files containing special characters (UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f) #495

Closed DrSAS closed 11 months ago

DrSAS commented 11 months ago

When trying to index my Obsidian files using Khoj, I encountered an error related to the charmap codec. The error message suggests an issue with decoding the byte 0x8f.

Error message :

WARNING  Unable to read file: G:\Mon disque\Obsidian\SlipBox\SlipBox\Dates\2023\2023-10-07.md as markdown. Skipping file.                            fs_syncer.py:171
           WARNING  'charmap' codec can't decode byte 0x8f in position 325: character maps to <undefined>                                                       fs_syncer.py:172
                    ╭─────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────╮
                    │ C:\Users\Benoit\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-pack │
                    │ ages\khoj\utils\fs_syncer.py:169 in get_markdown_files                                                                                  │
                    │                                                                                                                                         │
                    │   166 │   for file in all_markdown_files:                                                                                               │
                    │   167 │   │   with open(file, "r") as f:                                                                                                │
                    │   168 │   │   │   try:                                                                                                                  │
                    │ ❱ 169 │   │   │   │   filename_to_content_map[file] = f.read()                                                                          │
                    │   170 │   │   │   except Exception as e:                                                                                                │
                    │   171 │   │   │   │   logger.warning(f"Unable to read file: {file} as markdown. Skipping                                                │
                    │       file.")                                                                                                                           │
                    │   172 │   │   │   │   logger.warning(e, exc_info=True)                                                                                  │
                    │                                                                                                                                         │
                    │ C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1264.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py:23 in decode   │
                    │                                                                                                                                         │
                    │    20                                                                                                                                   │
                    │    21 class IncrementalDecoder(codecs.IncrementalDecoder):                                                                              │
                    │    22 │   def decode(self, input, final=False):                                                                                         │
                    │ ❱  23 │   │   return codecs.charmap_decode(input,self.errors,decoding_table)[0]                                                         │
                    │    24                                                                                                                                   │
                    │    25 class StreamWriter(Codec,codecs.StreamWriter):                                                                                    │
                    │    26 │   pass                                                                                                                          │
                    ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 325: character maps to <undefined>

I suspect that the error might be related to special characters or emojis present in my Obsidian files, such as the "✒️" emoji. The files are in UTF-8 and I encounter a large number of errors when trying to index my files with Khoj.

I installed it on Windows 11 using Powershell, following the instructions in the documentation.

Thank you! :)

debanjum commented 11 months ago

Hi @DrSAS, thanks for raising this issue and sharing the error message / stacktrace. I've pushed a fix for this in commit https://github.com/khoj-ai/khoj/commit/d9d133dfb9d08b32b0ae482fde5462bc39c3f853. Let me know if the issue persists after upgrading your khoj server with pip install --upgrade --pre khoj-assistant

DrSAS commented 11 months ago

@debanjum The update has resolved the issue. It worked! Thank you!

debanjum commented 11 months ago

Awesome!