Inference of Chinese handwritten characters is bad

naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

http://tesseract.projectnaptha.com/

Apache License 2.0

35.23k stars 2.23k forks source link

Inference of Chinese handwritten characters is bad #905

Closed piscopancer closed 7 months ago

piscopancer commented 7 months ago

tesseract 5.0.5

I use a canvas and feed it to a worker. it succeeds recognizing handwritten chinese characters 1 in 4 times, in other words, it works but i expected more. I may have misconfigured my tesseract worker or what if tesseract is not trained to work with handwritten characters and I just do not know it?

Please tell me how to improve the inference

2024-03-29_22-03

This is my code (nextjs, typescript)

'use client'

import Tesseract, { createWorker } from 'tesseract.js'

// ...
async function createSetWorker() {
  const worker = await createWorker('chi_sim', 1)
  return worker
}

async function recognize(worker: Tesseract.Worker) {
  const {
    data: { text, symbols },
  } = await worker.recognize(canvasRef.current, {}, {})
  setRecognition({ text, symbols: symbols.flatMap((s) => s.choices) })
}
// ...

Balearica commented 7 months ago

Handwritten text is not supported by Tesseract. The Tesseract OCR model is built around assumptions that only hold for printed text. No combination of options will significantly improve performance with handwritten text. Unless your handwriting is so good that it closely resembles printed text, the results will be poor.

piscopancer commented 7 months ago

@Balearica understood. I will keep it in my project anyway BCS I need to let users scan images. Can you recommend a library for handwritten recognition? I have learned about handwriting.js but not only it is just a GitHub only project and I it was never uploaded on npm, it also uses outdated javascript and api is painful, no typescript either. Also is it possible to feed tesseract with a different dataset that was compiled with photos of handwritten Chinese characters? will this work?

Balearica commented 7 months ago

@piscopancer I do not personally know of any libraries for recognizing handwritten text.

It may be possible to improve results using different language data, however I don't know of any existing language data that does this, and am not overly optimistic that language data could be created. You can try searching the main Tesseract git issues or user forum for past discussion. Handwriting gets brought up from time to time, and to the best of my knowledge, nobody has ever claimed they can make it work with high accuracy.