The error in UTF-8 encoding when storing extracted Chinese entity names in vdb_entities.json. - Githubissues

gusye1234 / nano-graphrag

A simple, easy-to-hack GraphRAG implementation

MIT License

839 stars 87 forks source link

The error in UTF-8 encoding when storing extracted Chinese entity names in vdb_entities.json. #63

Open zhouyujin opened 5 days ago

zhouyujin commented 5 days ago

When I use all-MiniLM-L6-v2 as the local vector encoding model to store the extracted Chinese entities in vdb_entities.json, the content I see is as follows, which is the Unicode encoding of Chinese, not UTF-8 encoding. The encoding method for writing to the JSON file in _utils.py is "utf-8". Is it related to the vector model I chose? How should I modify it?

{"embedding_dim": 384, "data": [{"id": "ent-3823944778412382ad171c8152055cdd", "entity_name": "\"\u592a\u767d\u91d1\u661f\u674e\u957f\u5e9a\""}, {"id": "ent-d6bbc7e0fc25691f48df9a5550de5c64", "entity_name": "\"\u8001\u9e64\""}, {"id": "ent-71cc9a28268df396b25d165ae4b899a6", "entity_name": "\"\u542f\u660e\u6bbf\""}, ……

gusye1234 commented 4 days ago

Are you using the latest commit?