naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
35.27k stars 2.23k forks source link

How do we train it ourselves? #176

Closed NoobDoesMC closed 5 years ago

NoobDoesMC commented 6 years ago

I want to work on a super specific set of input files.

Can I train it myself?

Scotthorn0 commented 6 years ago

@NoobDoesMC I'm not a contributor, but I don't think Tesseract currently supports that.

The best bet might be to make use of the black/white list parameters. See all the parameters here.

You could run your own data collection and add/remove characters to those lists as you process items and find better results.

The other option is to build a full data set of the results and start processing that data with your own machine learning or something similar.

NoobDoesMC commented 6 years ago

I know Tesseract supports this.

I did research since this question and it looks like I might be able to create my own language data files right?

marshallwhiteorg commented 6 years ago

@NoobDoesMC Yes, you can just train using the regular Tesseract engine (the one tesseract.js uses) and create your .traineddata file to be used within tesseract.js.

NoobDoesMC commented 6 years ago

Sounds perfect. Sounds tricky to create those though, I saw an online tutorial and it looks like a lot of manual work is also required.

marshallwhiteorg commented 6 years ago

@NoobDoesMC Yeah I made one with text in a seven segment font (digital font) and it was not very fun. Hardest part was obtaining data to train on, what type of text are you planning on using?

NoobDoesMC commented 6 years ago

@mwh1te I want to train it to extract the text from screenshots of minecraft games, e.g. to turn this image:

screen shot 2018-01-14 at 00 33 43

Into this text:

NoobDoesMC's Stats
Super Smash Mobs
Wins: 856
Games Played: 2455
SO SUPER!
MLG Pro
...
Click for more details!

I know that I may have to do some pre-processing on the image to make it as accurate as possible, e.g. cropping, but idk specifically what yet.

VincentBrule commented 6 years ago

Did you succed with train tesseract please ?

pratham2003 commented 6 years ago

@mwh1te Which version of the training files can be use? 3 / 4..?

Off-topic @NoobDoesMC Doing it as your own Java plugin for Minecraft might be easier and give better results than training an OCR.

jeromewu commented 5 years ago

To train it yourself, please check FAQ