instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
5 stars 13 forks source link

Updated Chunking Methods #45

Open PalmPalm7 opened 5 days ago

PalmPalm7 commented 5 days ago

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Notable updates:

  1. Used a efficient library to detect file type.
  2. Applied document-specific test splitter from Langchain in replace of original naive version.
  3. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  4. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.
PalmPalm7 commented 3 days ago

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Notable updates:

  1. Used a efficient library to detect file type.
  2. Applied document-specific test splitter from Langchain in replace of original naive version.
  3. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  4. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

@russellb @abhi1092 Hi Russel and Abhi, I have made final changes to the chunk_document() method and it passes all CI-Tests. Could you review my PR? Thanks

mergify[bot] commented 2 days ago

This pull request has merge conflicts that must be resolved before it can be merged. @PalmPalm7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb commented 1 day ago

The code moved to src/instructlab/sdg/utils/chunking.py, sorry for the trouble

abhi1092 commented 3 hours ago

A couple of things @PalmPalm7 :

PalmPalm7 commented 2 hours ago

A couple of things @PalmPalm7 :

  • Is it assumed that if the file is not any of code file it always markdown?
  • How do you handle chunks that are larger than the provided server_ctx_size ?

Thank you for the review and the comments @abhi1092 !

1) My Justification is there is I assume our most common use cases, as we discussed, is PDF into markdown formats, therefore default case should be markdown. Furthermore, by specifying the language param in RecursiveCharacterTextSplitter, it uses these following separators:

RecursiveCharacterTextSplitter.get_separators_for_language(Language.MARKDOWN)

Output:
['\n#{1,6} ',
 '```\n',
 '\n\\*\\*\\*+\n',
 '\n---+\n',
 '\n___+\n',
 '\n\n',
 '\n',
 ' ',
 '']

Instead of these separators.

["\n\n", "\n", " "]

2) This is valid concern. I have used the original chunk_document()'s logic, but it doesn't properly handle over context length chunk either.