tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
604 stars 178 forks source link

explicate .lstm-unicharset and my.unicharset prereqs for finetuning #260

Closed bertsky closed 1 year ago

bertsky commented 3 years ago

(because training fails if a .unicharset has already been created previously, but for a different START_MODEL)

bertsky commented 3 years ago

Wait. What if the original code did not target *.lstm-unicharset, but *.unicharset (or both)?

bertsky commented 3 years ago

Wait. What if the original code did not target *.lstm-unicharset, but *.unicharset (or both)?

Not relevant: we can only fine-tune from LSTM models, not (purely) Omnifont models.

bertsky commented 3 years ago

But there's an additional issue previously unnoticed: PROTO_MODEL's combine_lang_model recipe expects to see $(DATA_DIR)/*.unicharset for every get_script_from_script_id in the unicharset table, i.e. {Common,Latin,Greek,Cyrillic,Hebrew}.unicharset, and an obscure Inherited.unicharset. But the current makefile (master and PR version) fails to provide these!

That leads to warnings like the following:

Failed to load script unicharset from:/home/kmw/nfs/gt-rücklauf/Latin.unicharset
Warning: properties incomplete for index 3 = M
Warning: properties incomplete for index 4 = A
Warning: properties incomplete for index 5 = T
Warning: properties incomplete for index 6 = I
Warning: properties incomplete for index 7 = O
Warning: properties incomplete for index 8 = ,
...

I do not know whether this is harmful, but we should try to explicate all rules necessary to put these files into $(DATA_DIR).

bertsky commented 3 years ago

I have no idea how to generate these files (except extracting from their respective script models).

@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old tesstrain_utils.sh intermediates?

bertsky commented 3 years ago

Perhaps we are missing the original set_unicharset_properties rule, which enriches the generated unicharset for the model?

stweil commented 3 years ago

I have no idea how to generate these files (except extracting from their respective script models).

@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old tesstrain_utils.sh intermediates?

I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but Inherited.unicharset).

bertsky commented 3 years ago

I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but Inherited.unicharset).

Oh, I see! But how could that have been forgotten in ocrd-train / tesstrain? Should we simply document this requirement, or fix this automatically by including a subrepo?

stweil commented 3 years ago

langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.

Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.

bertsky commented 3 years ago

langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.

Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.

Agreed. But perhaps we could live without the extra effort of parsing the exact requirements, since the unicharset files themselves are quite small.

Since there's already a wget of https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt (and of tessdata_best|fast/eng.traineddata), I opt for a fully automatic solution based on downloads and will add a commit here (or in a new PR?).

bertsky commented 3 years ago

Done. Please re-review!

bertsky commented 3 years ago

Done. Please re-review!

Or should we place all *.unicharset and radical-stroke.txt into a subdirectory langdata to keep DATA_DIR tidy? (Would only need to change the script_dir argument ...)

bertsky commented 3 years ago

Or should we place all *.unicharset and radical-stroke.txt into a subdirectory langdata to keep DATA_DIR tidy? (Would only need to change the script_dir argument ...)

Let's do this! That way, if someone already had the complete https://github.com/tesseract-ocr/langdata checked out locally, one could simply copy/symlink it here, or point the LANGDATA_DIR to the right spot. And all these *.unicharset do look quite messy lying about in DATA_DIR...

bertsky commented 3 years ago

Done. I have also updated from master to manually resolve the conflict, and added two minor improvements to the rules for all-gt / all-lstmf.

bertsky commented 3 years ago

There was some additional fallout to the all-lstmf / all-gt speedups (by not repeating find): with large directories, the paste recipe would quickly run into E2BIG (because not all command-line arguments fit one memory page). This is a long-standing, nasty bug in make, for which the only workaround seems to be using make's file function – and which I did manage to apply here.

Also added a new target charfreq, showing the character histogram of all .gt.txt files.

stweil commented 3 years ago

@bertsky, it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.

bertsky commented 2 years ago

it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.

As already explained above, the commits after your first review are all necessary (like the manual merge against conflicts in upstream master) and related (except for the very last commit). They are also all trivial.

I cannot see how splitting this PR up could improve anyone's productivity.

bertsky commented 2 years ago

@stweil this needs to be merged – please review

bertsky commented 1 year ago

This includes essential fixes and has been hanging here for over a year for no reason. Any objections to merging?