simonw / tools

Assorted tools
https://tools.simonwillison.net
Apache License 2.0
272 stars 30 forks source link

Try OCR rotation using rotateAuto: true #5

Closed simonw closed 7 months ago

simonw commented 7 months ago

Tesseract.js has a not-very-well documented option:

https://github.com/naptha/tesseract.js/blob/03f82eaab57d3c7c852c6e61bfd805c8cf42e8f2/src/index.d.ts#L96-L102

  interface RecognizeOptions {
    rectangle: Rectangle
    pdfTitle: string
    pdfTextOnly: boolean
    rotateAuto: boolean
    rotateRadians: number
  }

Found an example here: https://github.com/naptha/tesseract.js/blob/03f82eaab57d3c7c852c6e61bfd805c8cf42e8f2/examples/browser/image-processing.html

simonw commented 7 months ago

I tried it:

diff --git a/ocr.html b/ocr.html
index 3e4a177..d487c75 100644
--- a/ocr.html
+++ b/ocr.html
@@ -341,7 +341,7 @@ async function convertPDFToImages(file) {
 async function ocrImage(worker, imageUrl) {
   const {
     data: { text },
-  } = await worker.recognize(imageUrl);
+  } = await worker.recognize(imageUrl, {rotateAuto: true});
   return { text };
 }

But it didn't seem to work:

CleanShot 2024-03-31 at 17 39 24@2x