microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.28k stars 1.65k forks source link

[Bug]: UnicodeEncodeError: 'charmap' codec can't encode character '\u274c' in position 0: character maps to <undefined> #1120

Open monuminu opened 1 week ago

monuminu commented 1 week ago

Do you need to file an issue?

Describe the bug

File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\graphrag\index\api.py", line 79, in build_index progress_reporter.error(output.workflow) File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\graphrag\index\progress\rich.py", line 127, in error self._console.print(f"❌ [red]{message}[/red]") File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich\console.py", line 1683, in print with self: File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich\console.py", line 864, in exit self._exit_buffer() File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich\console.py", line 822, in _exit_buffer self._check_buffer() File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich\console.py", line 2024, in _check_buffer self._write_buffer() File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich\console.py", line 2060, in _write_buffer legacy_windows_render(buffer, LegacyWindowsTerm(self.file)) File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich_windows_renderer.py", line 19, in legacy_windows_render term.write_text(text) File "C:\Users\mrajguru.conda\envs\kotaemon\lib\site-packages\rich_win32_console.py", line 403, in write_text self.write(text) File "C:\Users\mrajguru.conda\envs\kotaemon\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u274c' in position 0: character maps to

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

natoverse commented 1 week ago

Is your input text English and UTF-8 encoded?

Sivan22 commented 6 days ago

i had the same problem with Hebrew text.

natoverse commented 6 days ago

We have some notes on non-English text here: #696