gusye1234 / nano-graphrag

A simple, easy-to-hack GraphRAG implementation
MIT License
1.66k stars 160 forks source link

Add docstrings and other comments in the functions #93

Closed yagneshgooglegithub closed 1 week ago

yagneshgooglegithub commented 2 weeks ago

Most of the private functions and other modules don't have comprehensive docstrings, making it harder to read the code and fit the custom usecase. Please provide those as early as possible.

For example, I am having a hard time understanding many things in the code which could have been easier had I had the comments and docstrings


    chunk_overlap_token_size: int = 100
    tiktoken_model_name: str = "gpt-4o"

    # entity extraction
    entity_extract_max_gleaning: int = 1
    entity_summary_to_max_tokens: int = 500
rangehow commented 1 week ago

As an open source project, we welcome anyone to contribute, including adding comments you deem necessary. Speaking of the 4 variables in the example: the first variable refers to the overlapping portions between different chunks when splitting documents using a sliding window approach - this concept is extremely common in RAG. The second is the tokenizer used to calculate document length. For the remaining two concepts, "gleaning" refers to the number of times entities are extracted from the same chunk, as mentioned in the GraphRAG paper. The last one is the token threshold that triggers summarization when extracted entities exceed a certain length.

yagneshgooglegithub commented 1 week ago

As an open source project, we welcome anyone to contribute, including adding comments you deem necessary. Speaking of the 4 variables in the example: the first variable refers to the overlapping portions between different chunks when splitting documents using a sliding window approach - this concept is extremely common in RAG. The second is the tokenizer used to calculate document length. For the remaining two concepts, "gleaning" refers to the number of times entities are extracted from the same chunk, as mentioned in the GraphRAG paper. The last one is the token threshold that triggers summarization when extracted entities exceed a certain length.

Thanks for the reply. I get the point that it's an open source project. But try to add comments to all the fields of various class if possible. For others, it could take 2 to 3 hours of here and there which authors could do in 2 to 3 seconds. Thanks again, the repo is very easy to use in comparison to microsoft's one.