tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Support additional line image formats (*.png, *.bin.png, *.nrm.png) #117

Closed stweil closed 4 years ago

stweil commented 4 years ago

Tesseract can handle most common image formats while the Makefile only had supported *.tif.

Line images from other Open Source OCR software use .bin.png or .nrm.png. The Makefile now handles these, too, so it is no longer necessary to convert them to *.tif.

Signed-off-by: Stefan Weil sw@weilnetz.de

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shreeshrii commented 4 years ago

This should not be closed.

I suggest that the makefile support tif, png and the suggested (.bin.png, .nrm.png)

wrznr commented 4 years ago

@Shreeshrii @stweil Would you agree to have another variable EXT?

(env) $ PYTHONIOENCODING="UTF-8" make -j 4 training MODEL_NAME=htr TESSDATA=/usr/local/share/tessdata/ PSM=13 MAX_ITERATIONS=10000 EXT=.bin.png
Shreeshrii commented 4 years ago

Please see earlier comments by @stweil. I think his use case has images files with many (two?) different extensions.

wrznr commented 4 years ago

@Shreeshrii Right. I forgot. Sorry.

Shreeshrii commented 4 years ago

https://stackoverflow.com/questions/41893115/makefile-select-files-by-extension-from-a-variable/41893264 suggests

apply two filters in one go:

OBJS = $($(SRCS:%.c=%.o):%.cpp=%.o)

Can something similar be used for image extensions?

stweil commented 4 years ago

I updated the PR now to fix merge conflicts.

wrznr commented 4 years ago

We should have a look at the more generic solution offered by @kba.

stweil commented 4 years ago

We should have a look at the more generic solution offered by @kba.

Up to now I did not manage to get that working.

Would you agree to have another variable EXT?

The current PR at least handles some common cases. Can we apply it as it is and improve it later, for example with a comma separated list of image extensions like IMAGE_EXTENSIONS=tif,nrm.png,bin.png? I think EXT is too unspecific, and the different image extensions must be clearly separated from each other.

stweil commented 4 years ago

I suggest that the makefile support tif, png and the suggested (.bin.png, .nrm.png)

The latest commit added support for *.png, so all those patterns are supported now.