instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
5 stars 13 forks source link

Update chunking for knowledge documents #34

Open aakankshaduggal opened 4 days ago

aakankshaduggal commented 4 days ago

Currently we have a fixed chunking (every N words/tokens), this has issues with hanging sentences, table and code. We want a better way to do this chunking, something which is context aware. Would be good to add this to our list of todos

PalmPalm7 commented 3 days ago

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Notable updates:

  1. Used an efficient library to detect file type.
  2. Applied document-specific test splitter from Langchain in replace of original naive version.
  3. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  4. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

Pull Request: https://github.com/instructlab/sdg/pull/45

Link to Demo: https://colab.research.google.com/drive/1lYHhhqQaOkKQCoFkhx0M7AT7N5UUvdRa?usp=sharing