Currently we have a fixed chunking (every N words/tokens), this has issues with hanging sentences, table and code. We want a better way to do this chunking, something which is context aware. Would be good to add this to our list of todos
Applied document-specific test splitter from Langchain in replace of original naive version.
Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.
Currently we have a fixed chunking (every N words/tokens), this has issues with hanging sentences, table and code. We want a better way to do this chunking, something which is context aware. Would be good to add this to our list of todos