instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
23 stars 35 forks source link

Further update chunking strategies to improve performance. #66

Closed PalmPalm7 closed 4 months ago

PalmPalm7 commented 4 months ago

This issue is a follow-up on this issue: https://github.com/instructlab/sdg/issues/34, and the PR attempting to resolve it: https://github.com/instructlab/sdg/pull/65.

This issue attempts to improve on the following things.

  1. Implement document-specific chunking from the original PR where dependency was in a soft freeze state.
  2. Address the maximum chunk length issue
  3. A benchmark leveraging, DeepEval, Truera, or other methods to evaluate how much the retrieval from RAG could be improved.
  4. Improve heuristics logic to take-in numerous file formats and corresponding parsing and chunking.
  5. Experiment with semantic chunking, leveraging granite to create chunk embeddings.

cc: @aakankshaduggal @abhi1092 @shivchander

ktam3 commented 4 months ago

Per our convo, closing this as done