instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
24 stars 37 forks source link

Prefer tesserocr vs easyocr for Docling integration, when available #352

Closed bbrowning closed 1 week ago

bbrowning commented 2 weeks ago

Docling defaults to using easyocr for optical character recognition, but we have some downstream consumers that will prefer to use Docling's tesserocr for OCR. We need to expose a way for users to influence which we use, as it requires code changes in our Docling integration to swap the OCR engine used.

bbrowning commented 2 weeks ago

This will likely also imply we need to adjust our docling in requirements.txt to pull in docling[tesserocr] instead of docling. The tesserocr variant pulls in both tesserocr and easyocr, allowing us to swap between each with the single dependency.

bbrowning commented 1 week ago

Instead of exposing a new configuration knob here, we'll just prefer tesserocr when it's available and automatically fallback to easyocr when it isn't. If neither tesserocr nor easyocr load, we'll log an error and disable optical character recognition.