ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

Model for french medieval manuscript #320

Closed y0un35 closed 5 years ago

y0un35 commented 5 years ago

Hello everyone i have some old books in french, and i want to try ocropus on it but couldn't find a good model for it! what should i do?!

kba commented 5 years ago

You could try https://github.com/zuphilip/ocropy-french-models or train your own.

y0un35 commented 5 years ago

You could try https://github.com/zuphilip/ocropy-french-models or train your own.

I already tried it , but it doesn't work perfectly, the books have an old writing. (if you have another model) thanks in advance.

i have other question too, about dataset, it's not annotated (doesn't have labels) can ocopus do this and then caompute accuracy based on these labels?

stweil commented 5 years ago

@y0un35, are your scans accessible online? That would help for giving an advice.

kba commented 5 years ago

You can also try kraken with e.g. https://github.com/PonteIneptique/toebler-ocr.

y0un35 commented 5 years ago

@y0un35, are your scans accessible online? That would help for giving an advice.

@stweil Hi, the scans are similar to these ones: https://www.unicaen.fr/bvmsm/img-viewer/IMG/IMPR/BM/A/viewer.html?name=A1_5.jpg https://www.unicaen.fr/bvmsm/cdc.html

is it possible to train for example "frakture model" on these scan will work? or these scan should be labeled?! because my dataset has no labels and that why i asked if ocropus can get labels itself?!

stweil commented 5 years ago

Your first example works partially with Tesseract and no special training:

$ tesseract A1_5.jpg - -c user_defined_dpi=300 -l script/Latin --psm 6
F
Ea tific ftudiofe lecto? fit i6 libi Fronicarum per
8 B viam epitbomatis z breuiarj Compila opus jde
 preclarum.z a doctiflimo quoa comparatidum.£ ontinet
ein gefta. queciqs O1gmo:a funt NOtgtu ab initio mid ad
banc vígs tepos softri calamitatem, L aftigattigza virs
doctiffimisvt magis elabdzatum tn lucem prodiret. Adin
— mut guten T preces P ciii Sebald Schreyer
Sebaran kamermaili bunc libum domnus Antho
niys koberger Muren pergeimpzeffit. Adhibitis tamé vé.
ris matbematicis pingendiqs arte peritiffimis, 191chaele
~ wolgemurerwilbelmo Pleydemwurff.quarii fotera anv
 Siflmagsanimaduerfore mm auam a
“prrozumfigureinferte funt. £onfummatu gutci) oyode”
sia menfis 3ulý Zinno BIOS KÉCtA93
E S A sa i n

$ tesseract A1_5.jpg - -c user_defined_dpi=300 -l script/Fraktur --psm 6
F
AE unc ?kudio?e lecto? fit 1B libti Erotiicattim ber
SE Y viam epitbomatis 7 bzeuiari conpuilati opus de
pzeclarum.7zadoctiflimo quoqs comparatidum.Continet
cm ge?ta. quecücz digmoza ?unt notatu ab initiomsdi a>
anc vlgzs tepozis yzo?tri calámitatem, La?tigattiqza viris
docti??imisvt magis elabózatum in lucem prodiret. Adin
tui gutem 7?Þpzeces PELL ciu SebaldiSchreyer
_7Seba?tiatni kamermaill bunclibz2um dominus Antbo
niius koberger urew bergeimpzeflit. Adhibitis tamê vé
ris matbematicis pingendiaz arte peritif?imis, Michaele
_wolgemuter wilbelmo {>leydenwurff, quaru lor
ArlffimaganimadueNione rum A em 0yodec:
i “pirozumfigureinferte ?unt,Lon?ummatu gute dyodec-
“inamen?isJulj,Anno ?aluti nfl
EE ORE C a ut tE, 2A
y0un35 commented 5 years ago

@stweil without training and it works! so someone else already did the job. thanks, but what about ocropus? i like to try it to! how can i do it?! also, is there a function in ocropus that show me the accuracy and errors also plot them?

thanks again

zuphilip commented 5 years ago

There are some documentation in the wiki for ocropus https://github.com/tmbdev/ocropy/wiki and also further links which handle training. For the accuracy there are tools as described here https://github.com/tmbdev/ocropy/wiki/Compute-errors-and-confusions but they will not plot a graph automatically.

kba commented 5 years ago

@y0un35 I think you've received some pointers and links, feel free to open a new issue for further questions.