Closed NoobDoesMC closed 5 years ago
@NoobDoesMC I'm not a contributor, but I don't think Tesseract currently supports that.
The best bet might be to make use of the black/white list parameters. See all the parameters here.
You could run your own data collection and add/remove characters to those lists as you process items and find better results.
The other option is to build a full data set of the results and start processing that data with your own machine learning or something similar.
I know Tesseract supports this.
I did research since this question and it looks like I might be able to create my own language data files right?
@NoobDoesMC Yes, you can just train using the regular Tesseract engine (the one tesseract.js uses) and create your .traineddata file to be used within tesseract.js.
Sounds perfect. Sounds tricky to create those though, I saw an online tutorial and it looks like a lot of manual work is also required.
@NoobDoesMC Yeah I made one with text in a seven segment font (digital font) and it was not very fun. Hardest part was obtaining data to train on, what type of text are you planning on using?
@mwh1te I want to train it to extract the text from screenshots of minecraft games, e.g. to turn this image:
Into this text:
NoobDoesMC's Stats
Super Smash Mobs
Wins: 856
Games Played: 2455
SO SUPER!
MLG Pro
...
Click for more details!
I know that I may have to do some pre-processing on the image to make it as accurate as possible, e.g. cropping, but idk specifically what yet.
Did you succed with train tesseract please ?
@mwh1te Which version of the training files can be use? 3 / 4..?
Off-topic @NoobDoesMC Doing it as your own Java plugin for Minecraft might be easier and give better results than training an OCR.
I want to work on a super specific set of input files.
Can I train it myself?