instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
23 stars 35 forks source link

Prefer tesserocr over easyocr, if available #369

Closed bbrowning closed 1 week ago

bbrowning commented 1 week ago

When setting up our ingestion pipeline, explicitly check if tesserocr is available and Docling can load it. If so, prefer that. Otherwise, attempt the same for EasyOCR. If neither can load, log an error and disable optical character recognition.

Fixes #352

bbrowning commented 1 week ago

Thanks for the approval! After following some discussion elsewhere about being careful when we import anything that imports all of torch, I'm going to add an additional test to this and defer some of the docling/easyocr imports to not import transformers or torch until they're actually needed. Just a small change, but realized that should go in as part of this because otherwise we're loading all of torch fairly early in our import chain.

bbrowning commented 1 week ago

Ok, removing the hold now that we're not importing all of Pytorch as soon as someone imports SDG. Instead, we defer that until Docling actually needs torch loaded by moving some of our imports of docling bits further down into the code. And, the added test ensures we don't accidentally regress on that as we do future docling work here.

khaledsulayman commented 1 week ago

Thanks for taking care of this, Ben! 😁

nathan-weinberg commented 1 week ago

@Mergifyio backport release-v0.5

mergify[bot] commented 1 week ago

backport release-v0.5

✅ Backports have been created

* [#391 Prefer tesserocr over easyocr, if available (backport #369)](https://github.com/instructlab/sdg/pull/391) has been created for branch `release-v0.5`