Download Latin.unicharset along with radical-stroke.txt

tesseract-ocr / tesstrain

Train Tesseract LSTM with make

Apache License 2.0

620 stars 181 forks source link

Download Latin.unicharset along with radical-stroke.txt #219

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

Need another PR to add Inherited.unicharset after https://github.com/tesseract-ocr/langdata_lstm/pull/41/ is merged

stweil commented 3 years ago

All unicharset files for scripts are potentially needed, starting with Arabic.unicharset and ending with Thai.unicharset.

I usually get the required ones to satisfy the error message(s), but still don't know what happens if they are missing.

Shreeshrii commented 3 years ago

I added only Latin and Inherited unicharsets in this list because these are required in almost all cases, even though they don't stop processing like missing radical-stroke.txt.

We could add another optional variable for SCRIPT_UNICHARSET, downloading it when it is non-blank.

still don't know what happens if they are missing.

I think some characters e.g. Arabic accents get dropped in the generated unicharset by unicharset_extractor. That was the reason I built the Inherited.unicharset.

stweil commented 3 years ago

A list of all required *.unicharset files can be extracted from unicharset:

sed s/.*0,0,0.// $(OUTPUT_DIR)/unicharset | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common

Shreeshrii commented 3 years ago

Thanks for the suggestions @stweil and the hint to get the list of required unicharsets from $(OUTPUT_DIR)/unicharset.

I am having a hard time putting it together in a separate Makefile target using the list. Would appreciate if you can make the required change.

Here is what I have tried so far:

SCRIPT_NAMES := $(shell cat $(OUTPUT_DIR)/unicharset | sed s/.*0,0,0.// | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common | sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d')
SCRIPT_UNICHARSETS = $(foreach script,$(SCRIPT_NAMES),$(script).unicharset)
scriptunicharsets: $(SCRIPT_UNICHARSETS)
$(DATA_DIR)/%.unicharset:%.unicharset
    echo $@
    wget -O $@ 'https://github.com/tesseract-ocr/langdata/raw/master/$@'

wrznr commented 3 years ago

@kba Could you pls. have a look at the change request and maybe come up with a proposal?

Shreeshrii commented 3 years ago

I added sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d' to the command suggested by @stweil because there are no unicharsets for Common and Inherited . Joined was being picked up accidentally.

A simpler way maybe asking the user to specify a script and download that.

Shreeshrii commented 3 years ago

A simpler way maybe asking the user to specify a script and download that.

I have tried that in the new Makefile-font2model I think that is a much cleaner way of doing this.

Shreeshrii commented 3 years ago

Included as part of https://github.com/tesseract-ocr/tesstrain/pull/230