mikahama / natas

Python 3 library for processing historical English
Apache License 2.0
64 stars 11 forks source link

Hello, I just begin studying about OCR-correction. Could you tell me how to use natas with a pretrained-model? #2

Closed trongvanhpkt99 closed 4 years ago

trongvanhpkt99 commented 4 years ago

I want to train a model for OCR-correcting output in Vietnamese, so at fist I want to know how to use a pre-trained model

mikahama commented 4 years ago

We only have a pretrained model for English at the moment, so it will not work with Vietnamese. Natas calls OpenNMT-py on the background, so basically you can use onmt_translate with your own model, pass it -n_best 10 and filter the results with a dictionary.

trongvanhpkt99 commented 4 years ago

We only have a pretrained model for English at the moment, so it will not work with Vietnamese. Natas calls OpenNMT-py on the background, so basically you can use onmt_translate with your own model, pass it -n_best 10 and filter the results with a dictionary.

Thank you! Can you give me the English pretrained model and tell me how to use it?

mikahama commented 4 years ago

This is how to use it from Natas:

import natas
natas.ocr_correct_words(["paft", "friendlhip"])

To use it with OpenNMT, you must first download the model.

Then you will need to prepare a text file with the words you want to OCR post-correct so that there is one word per line and each word should be split into characters.

So if you have a sentence cat ran avvay you should produce the following text file _ocrerrors.txt

c a t
r a n
a v v a y

Then you can run onmt_translate -model ocr.pt -src ocr_errors.txt -output ocr_fixed.txt -replace_unk -verbose. This will produce a text file _ocrfixed.txt with the OCR corrections. OpenNMT lets you do all sorts of things in translate, so please refer to their documentation as well.

trongvanhpkt99 commented 4 years ago

This is how to use it from Natas:

import natas
natas.ocr_correct_words(["paft", "friendlhip"])

To use it with OpenNMT, you must first download the model.

Then you will need to prepare a text file with the words you want to OCR post-correct so that there is one word per line and each word should be split into characters.

So if you have a sentence cat ran avvay you should produce the following text file _ocrerrors.txt

c a t
r a n
a v v a y

Then you can run onmt_translate -model ocr.pt -src ocr_errors.txt -output ocr_fixed.txt -replace_unk -verbose. This will produce a text file _ocrfixed.txt with the OCR corrections. OpenNMT lets you do all sorts of things in translate, so please refer to their documentation as well.

Thank you! I'll try it