Closed isaac-pocketfm closed 1 month ago
Having a large document split into many text chunks is a very common setup. Can you upload your indexing-engine.log? How big is your input document? Do you know how many text chunks result?
Apologies, I'm new to the library and don't know how to generate indexing-engine.log
. I observed the error on a document that is just long enough to be split into 2 chunks. Adding prechunked: true
to the strategy was an effective workaround.
Figured it out, here's the log from one of the offending documents: indexing-engine.log
@natoverse I think you can remove the awaiting_response
tag
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
This issue has been closed after being marked as stale for five days. Please reopen if needed.
Do you need to file an issue?
Describe the bug
When running the
create_base_extracted_entities
workflow on a large input file, in therun_extract_entities
function, the call totext_splitter.split_text
results intext_list
having a different number of elements thandocs
. This means that the document indices in the results returned by theextractor
do not align with thedocs
array, causing incorrect assignment of entities to docs, and potentially throwing anIndexError
.Steps to reproduce
Run
run_pipeline_with_config
with the following config:and with a target text file long enough that it is split into multiple chunks by the default
text_splitter
.Expected Behavior
Entity extraction should complete successfully.
GraphRAG Config Used
Logs and screenshots
Additional Information