instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
20 stars 34 forks source link

Make tesserocr vs easyocr configurable for Docling integration #352

Open bbrowning opened 4 hours ago

bbrowning commented 4 hours ago

Docling defaults to using easyocr for optical character recognition, but we have some downstream consumers that will prefer to use Docling's tesserocr for OCR. We need to expose a way for users to influence which we use, as it requires code changes in our Docling integration to swap the OCR engine used.

bbrowning commented 4 hours ago

This will likely also imply we need to adjust our docling in requirements.txt to pull in docling[tesserocr] instead of docling. The tesserocr variant pulls in both tesserocr and easyocr, allowing us to swap between each with the single dependency.