naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
35.31k stars 2.23k forks source link

Got wrong result in Vietnamese language #869

Closed sonht1109 closed 10 months ago

sonht1109 commented 10 months ago

Tesseract.js version ^5.0.4

Describe the bug I had tested ocr on one image which included Vietnamese text. Everything seemed okay but it still left one issue involving digits.

To Reproduce My code

import { createWorker } from 'tesseract.js';

(async () => {
  const worker = await createWorker('vie');
  const ret = await worker.recognize('https://vov.vn/sites/default/files/inline-images/bai1sapo1.jpg');
  console.log(ret.data.text);
  await worker.terminate();
})();

Output: image

Expected behavior 21/0/1973 => 21/9/1973

Device Version:

Balearica commented 10 months ago

Tesseract.js is the Javascript/Webassembly port of Tesseract. We do not make any edits to the recognition engine, so any accuracy issues with the Tesseract engine are outside of the scope of this project. If you would like to pursue further, you should (1) check whether the issue is present when using the main (CLI) Tesseract project and (2) if so, and you believe this constitutes a bug, raise the issue with that project.