instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
5 stars 13 forks source link

Ensure "simple" pipeline does not regress knowledge document chunking behavior #52

Closed russellb closed 1 hour ago

russellb commented 2 days ago

In the conversion to use new APIs, we added a "simple" pipeline intended to work on small environments (laptops, etc). The document chunking behavior was left as a TODO item and is probably a regression. It needs to be revisited before releasing a new version of the library.

https://github.com/instructlab/sdg/blob/1f71fb67aa46151bc0362e733b06529a1c609e6d/src/instructlab/sdg/generate_data.py#L293-L301

Follow-up to #46