Issues with Handling Different Encoding Formats & Prompt Structures

waetr commented 2 months ago

Dear authors,

Thanks very much for your generous and valuable contributions to this project! When I was trying to hack the code, I discovered a few minor issues. I listed all of them below:

1. Error when indexing files with encoding formats other than 'gbk'

When I try to run the code as per the instruction of readme.md with mock_data.txt as input (which is encoded as "utf-8-sig"), an error occurs in the indexing stage:

While following the instructions in readme.md and using mock_data.txt (which is encoded as "utf-8-sig") as input, I encountered an error during the indexing stage. The error output has been shown below:

...
⠸ Processed 93 communities
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:nano-graphrag:Writing graph with 678 nodes, 563 edges
Traceback (most recent call last):
  File "D:\GraphRAG\nano-graphrag\tests\test.py", line 9, in <module>
    graph_func.insert(f.read())
  File "D:\GraphRAG\nano-graphrag\nano_graphrag\graphrag.py", line 169, in insert
    return loop.run_until_complete(self.ainsert(string_or_strings))
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
    return future.result()
  File "D:\GraphRAG\nano-graphrag\nano_graphrag\graphrag.py", line 274, in ainsert
    await self._insert_done()
  File "D:\GraphRAG\nano-graphrag\nano_graphrag\graphrag.py", line 289, in _insert_done
    await asyncio.gather(*tasks)
  File "D:\GraphRAG\nano-graphrag\nano_graphrag\_storage.py", line 37, in index_done_callback
    write_json(self._data, self._file_name)
  File "D:\GraphRAG\nano-graphrag\nano_graphrag\_utils.py", line 74, in write_json
    json.dump(json_obj, f, indent=2, ensure_ascii=False)
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
UnicodeEncodeError: 'gbk' codec can't encode character '\u2122' in position 170847: illegal multibyte sequence

In my case, I fixed it by simply modifying the write_json function to make its written format align with the input file format:

def write_json(json_obj, file_name):
    with open(file_name, "w", encoding="utf-8-sig") as f:
        json.dump(json_obj, f, indent=2, ensure_ascii=False)

I don't expect this to fully resolve the whole issue as it doesn't apply to all encoding formats. Perhaps we could find a more flexible solution that can handle various encoding formats for the input text file.

2. issues with csv-formatted prompts

Specifically, in the function async def _build_local_query_context in _op.py, I noticed that the strings communities_context and text_units_context are designed to follow the .csv format. However, while each cell represents the content of a multiline text chunk or a community summary, the cells are not enclosed in double quotes. For example, the string looks like this:

index, content
1, xxx
    xxx
    xxx
2, yyy
    yyy
    yyy
...

While I think the following looks more formal as the .csv format, if it contains multiple lines:

index, content
1, "xxx
    xxx
    xxx"
2, "yyy
    yyy
    yyy"
...

In contrast, other strings like entities_context and relations_context are correctly enclosed in quotes to preserve cell boundaries. Although this is a minor issue, I am concerned that it might lead to potential problems, such as misrecognition by LLMs.

That's the full description. Wish this project continued growth!

gusye1234 commented 2 months ago

Hi, thank for you enthusiasm For the first issue: sorry I don't find the differences between the implementation and your fixed version https://github.com/gusye1234/nano-graphrag/blob/681667963c0f6fbedef53e5dbb6786ac64b2634d/nano_graphrag/_utils.py#L71-L73

Maybe you copy the before implementation not your fix?

For the second issue: Yeah, I think that's a problem. We should enclose multi-line cell.

waetr commented 2 months ago

Hi, thank for you enthusiasm For the first issue: sorry I don't find the differences between the implementation and your fixed version

https://github.com/gusye1234/nano-graphrag/blob/681667963c0f6fbedef53e5dbb6786ac64b2634d/nano_graphrag/_utils.py#L71-L73

Maybe you copy the before implementation not your fix?

For the second issue: Yeah, I think that's a problem. We should enclose multi-line cell.

For issue 1: sry for making a mistake here. See the latest update: the fixed version is

def write_json(json_obj, file_name):
    with open(file_name, "w", encoding="utf-8-sig") as f:
        json.dump(json_obj, f, indent=2, ensure_ascii=False)

gusye1234 / nano-graphrag

Issues with Handling Different Encoding Formats & Prompt Structures #20