tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.4k forks source link

Create custom fonts and inform tesseract 4 about their existence : format for listing not specified #1672

Closed srdg closed 6 years ago

srdg commented 6 years ago

Environment

Current Behavior:

The documentation for fonts says

The required fonts are defined in training/language-specific.sh. Many more fonts are listed in langdata/font_properties. If you add fonts to the first file (or specify them explicitly via command line parameter), you must add them to the second as well.

I've got a folder full of font files. The idea is to treat handwritten text by each different person as a unique font and check how tesseract 4 performs on that. So according to the docs, I have to list them in langdata/font_properties - but neither the README nor the file itself list the format in which a custom font could be specified, and it seems that the fonts are specified according to some format. What do I do? @Shreeshrii @theraysmith any help would be appreciated, thanks!

Shreeshrii commented 6 years ago

font_properties are required only for tesseract3.

Tesseract4 will get the info from the font files themselves, using Pango.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 14, 2018 at 1:12 PM Soumik Ranjan Dasgupta < notifications@github.com> wrote:

Environment

  • Tesseract Version: 4.0.0-beta.1-370-g8b64
  • Platform: Ubuntu 16.04

Current Behavior:

The documentation for fonts https://github.com/tesseract-ocr/tesseract/wiki/Fonts says

The required fonts are defined in training/language-specific.sh. Many more fonts are listed in langdata/font_properties. If you add fonts to the first file (or specify them explicitly via command line parameter), you must add them to the second as well.

I've got a folder full of font files. The idea is to treat handwritten text by each different person as a unique font and check how tesseract 4 performs on that. So according to the docs, I have to list them in langdata/font_properties - but neither the README nor the file itself list the format in which a custom font could be specified, and it seems that the fonts are specified according to some format. What do I do?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1672, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ox1X3c2tjF1-qqVe2cpr5dv0jNodks5t8hPSgaJpZM4UndaI .

srdg commented 6 years ago

@Shreeshrii just so this is clear,

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

with --fonts_dir /path/to/customfonts set to the directory where my custom font files are would do the job, right?

Shreeshrii commented 6 years ago

I usually do a findfonts step with my training_text to find the correct font names to be used.

#!/bin/bash

nice text2image --find_fonts \
--fonts_dir ./.fonts \
--text ./langdata/san/san.training_text \
--min_coverage 0.995 \
--render_per_font=false \
--outputbase ./langdata/san/san \
|& grep raw \
 | sed -e 's/ :.*/@ \\/g' \
 | sed -e "s/^/  '/" \
 | sed -e "s/@/'/g" > ./langdata/san/san.fontslist.txt

and then I use this fontlist with the training command

tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.finetune.training_text \ --output_dir $train_output_dir

On Thu, Jun 14, 2018 at 3:19 PM Soumik Ranjan Dasgupta < notifications@github.com> wrote:

@Shreeshrii https://github.com/Shreeshrii just so this is clear,

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

with --fonts_dir /path/to/customfonts set to the directory where my custom font files are would do the job, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1672#issuecomment-397238229, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4AhZrgGcda9DCDTKJVaYTtGPvS_ks5t8jHFgaJpZM4UndaI .

srdg commented 6 years ago

@Shreeshrii I've cloned the langdata repo already and there are no files like san.iast.training_text , only one san.training_text file is present in the san directory. Same for the other subdirectories as well. Will only one training_text file suffice?

amitdo commented 6 years ago

Tesseract4 will get the info from the font files themselves, using Pango.

Pango supports fonts in TrueType and OpenType formats.

srdg commented 6 years ago

So, doing this is supposed to work? @amitdo

Shreeshrii commented 6 years ago

@srdg I use different training texts for different purposes. A smaller one for evaluation or finetuning. Larger one for adding a layer.

Follow the training tutorial. Test with one font and a small text. Once you understand the process, then expand.

srdg commented 6 years ago

@Shreeshrii I'm trying to follow the tutorial but its all too confusing for me. Clarify something to me : Tesseract 4 uses tesstrain.sh to generate box/tiff/lstmf files, among which the lstmf are stored and used for training. In the command-line argument to tesstrain.sh you have to specify the fonts-dir - so does this mean that I pass the font, and the training text, tesseract uses this font and the training text to render the .tiff image, which it uses in turn for training itself? Is that right? In case it is, is there any way in which I could directly pass the image and the box file that tesseract4 can use to train itself?

Shreeshrii commented 6 years ago

is there any way in which I could directly pass the image and the box file that tesseract4 can use to train itself?

  1. The format of box file is different from tesseract3 or the one created by using makebox. You need a tab character to mark end of line. You need a box for spaces between words.

  2. You can use the config file lstm.train to take existing box/tiff pairs (in tesseract4 format as specified above) and create lstmf files for the box/tiff files.

OR

  1. You can convert your images to one line per image and create a matching ground truth file. You can then use ocr-d/train project to create appropriate box files, then lstmf files, do lstmtraining and create a traineddata file.

Please note that the small amount of training text used for tesseract3 is NOT enough for training tesseract4. You could probably finetune an existing model with that.

There is NO official supported method for training tesseract 4 from box/tiffs.

Experiment and see what works for your case.

srdg commented 6 years ago

does this mean that I pass the font, and the training text, tesseract uses this font and the training text to render the .tiff image, which it uses in turn for training itself? Is that right?

@Shreeshrii Please confirm this is how tesseract 4 is working.

Shreeshrii commented 6 years ago

See https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L404

That is where the box/tiff pairs are processed with lstm.train to create lstmf files for tesseract4. for tesseract3 the config file is box.train.

You can see tesstrain.sh for the different variables being passed.

https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain.sh#L64

the box/tiff were kept in tmp directory only. However now we are also moving them alongwith lstmf files - see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L525

They are not used in any further processing but are there for reference.

srdg commented 6 years ago

@Shreeshrii it seems like this part of tesstrain_utils uses only the .tif images and the box files, what would happen if I don't pass the --fonts-dir? Is it a required argument?

srdg commented 6 years ago

Okay, I got the format from a documentation one of my seniors wrote. [fontname] italic bold fixed serif fraktur, 0 or 1 for each attribute of the font.

@Shreeshrii please confirm this and I'll close the issue.

Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-font_properties-file

This was required for 3.0x. not required for 4.0 as Pango can generate required info from ttf or otf files.

srdg commented 6 years ago

Okay, got it! Closing the issue.

srdg commented 6 years ago

@Shreeshrii @amitdo I'm trying to follow the training procedure using tesstrain.sh and the following error pops up.

=== Starting training for language 'eng'
[Thu Jul 12 15:31:24 IST 2018] /usr/local/bin/text2image --fonts_dir=./Fonts --font=Arial Bold --outputbase=/tmp/font_tmp.8e0UzBodL3/sample_text.txt --text=/tmp/font_tmp.8e0UzBodL3/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3
Could not find font named Arial Bold.
Pango suggested font Amatic SC Bold.
Please correct --font arg.

=== Phase I: Generating training images ===
Rendering using Arial Bold
[Thu Jul 12 15:31:25 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3 --fonts_dir=./Fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold.exp0 --max_pages=0 --font=Arial Bold --text=/home/soumik/langdata/eng/eng.training_text
Could not find font named Arial Bold.
Pango suggested font Amatic SC Bold.
Please correct --font arg.
ERROR: /tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold.exp0.box does not exist or is not readable
Rendering using Arial Bold Italic
[Thu Jul 12 15:31:26 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3 --fonts_dir=./Fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold_Italic.exp0 --max_pages=0 --font=Arial Bold Italic --text=/home/soumik/langdata/eng/eng.training_text
Could not find font named Arial Bold Italic.
Pango suggested font Fondamento Italic.
Please correct --font arg.

I made sure to list the fonts in font_properties and language-specific.sh, and rebuilt tesseract after that. What am I missing here?

Note : The first font enlisted is supposed to be Aladin-Regular , and it is so in the aforementioned two files, but somehow tesseract is not detecting them. The fonts are installed in my system too.

Shreeshrii commented 6 years ago

Because you are specifying fonts with

--font=Arial Bold

On Thu, Jul 12, 2018 at 3:43 PM Soumik Ranjan Dasgupta < notifications@github.com> wrote:

@Shreeshrii https://github.com/Shreeshrii @amitdo https://github.com/amitdo I'm trying to follow the training procedure using tesstrain.sh and the following error pops up.

=== Starting training for language 'eng' [Thu Jul 12 15:31:24 IST 2018] /usr/local/bin/text2image --fonts_dir=./Fonts --font=Arial Bold --outputbase=/tmp/font_tmp.8e0UzBodL3/sample_text.txt --text=/tmp/font_tmp.8e0UzBodL3/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3 Could not find font named Arial Bold. Pango suggested font Amatic SC Bold. Please correct --font arg.

=== Phase I: Generating training images === Rendering using Arial Bold [Thu Jul 12 15:31:25 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3 --fonts_dir=./Fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold.exp0 --max_pages=0 --font=Arial Bold --text=/home/soumik/langdata/eng/eng.training_text Could not find font named Arial Bold. Pango suggested font Amatic SC Bold. Please correct --font arg. ERROR: /tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold.exp0.box does not exist or is not readable Rendering using Arial Bold Italic [Thu Jul 12 15:31:26 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.8e0UzBodL3 --fonts_dir=./Fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.iMz6RLKTMH/eng/eng.Arial_Bold_Italic.exp0 --max_pages=0 --font=Arial Bold Italic --text=/home/soumik/langdata/eng/eng.training_text Could not find font named Arial Bold Italic. Pango suggested font Fondamento Italic. Please correct --font arg.

I made sure to list the fonts in font_properties and language-specific.sh, and rebuilt tesseract after that. What am I missing here?

Note : The first font enlisted is supposed to be Aladin-Regular https://fonts.google.com/specimen/Aladin , and it is so in the aforementioned two files, but somehow tesseract is not detecting them. The fonts are installed in my system too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1672#issuecomment-404462999, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o11xpqBCnVZ2UcVcpRbzgR5ukPmyks5uFyE8gaJpZM4UndaI .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

srdg commented 6 years ago

@Shreeshrii I modified the command to tesstrain.sh --lang eng --linedata_only --noextract_font_properties --exposures "0" --langdata_dir ~/langdata --tessdata_dir ~/tesseract/tessdata/ --output_dir ~/tesstutorial/engoutput after moving the .ttf files of the fonts I wanted to train tesseract on to the directory /usr/share/fonts/truetype/msttfonts - since it seemed the fonts tesseract was training on were located there. This time tesseract generated the lstmf files but not the fonts I wanted it to go on. Is anything else required?

Shreeshrii commented 6 years ago

When you don't specify font here, it will use the list of fonts specified in language_specific.sh for the given language code.

Give full path for tesstrain.sh to make sue you know which version is being used.

Tara-E commented 5 years ago

Just to double check - for Tesseract 4 I don't need to add any fonts to language-specific.sh, right? I can just pass in the whole list of fonts from the findtext step, and the training will use them, even if they are not listed in language-specific.sh?

Shreeshrii commented 5 years ago

Yes, that is correct. You can specify font list with Tesstrain.sh.

astutejoe commented 5 years ago

I tried creating a video tutorial to help those in need: https://www.youtube.com/watch?v=TpD76k2HYms

Idolized22 commented 5 years ago

fonts_for_training

I usually do a findfonts step with my training_text to find the correct font names to be used. #!/bin/bash nice text2image --find_fonts \ --fonts_dir ./.fonts \ --text ./langdata/san/san.training_text \ --min_coverage 0.995 \ --render_per_font=false \ --outputbase ./langdata/san/san \ |& grep raw \ | sed -e 's/ :.*/@ \\/g' \ | sed -e "s/^/ '/" \ | sed -e "s/@/'/g" > ./langdata/san/san.fontslist.txt and then I use this fontlist with the training command tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.finetune.training_text \ --output_dir $train_output_dir On Thu, Jun 14, 2018 at 3:19 PM Soumik Ranjan Dasgupta < @.***> wrote: @Shreeshrii https://github.com/Shreeshrii just so this is clear, training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain with --fonts_dir /path/to/customfonts set to the directory where my custom font files are would do the job, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1672 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4AhZrgGcda9DCDTKJVaYTtGPvS_ks5t8jHFgaJpZM4UndaI .

Hello @Shreeshrii

I am trying to train several custom fonts but i cant figure what should the value of the variable fonts_for_training be/. I am new to Linux and bash script . Thanks a lot in Advance

Shreeshrii commented 5 years ago

You can give the font names in a list like this as part of the command.

--fontlist "FreeSerif Italic" "iast_italic" "Lucida Calligraphy Italic" "Times New Roman, Italic"

On Thu, Jul 11, 2019 at 5:06 PM Idolized22 notifications@github.com wrote:

fonts_for_training

I usually do a findfonts step with my training_text to find the correct font names to be used. #!/bin/bash nice text2image --find_fonts \ --fonts_dir ./.fonts \ --text ./langdata/san/san.training_text \ --min_coverage 0.995 \ --render_per_font=false \ --outputbase ./langdata/san/san \ |& grep raw \ | sed -e 's/ :./@ \/g' \ | sed -e "s/^/ '/" \ | sed -e "s/@/'/g" > ./langdata/san/san.fontslist.txt and then I use this fontlist with the training command tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.finetune.training_text \ --output_dir $train_output_dir … <#m144933666443868534> On Thu, Jun 14, 2018 at 3:19 PM Soumik Ranjan Dasgupta < @*.***> wrote: @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii just so this is clear, training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain with --fonts_dir /path/to/customfonts set to the directory where my custom font files are would do the job, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1672 (comment) https://github.com/tesseract-ocr/tesseract/issues/1672#issuecomment-397238229>, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4AhZrgGcda9DCDTKJVaYTtGPvS_ks5t8jHFgaJpZM4UndaI .

Hello @Shreeshrii https://github.com/Shreeshrii

I am trying to train several custom fonts but i cant figure what should the value of the variable fonts_for_training be/. I am new to Linux and bash script . Thanks a lot in Advance

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1672?email_source=notifications&email_token=ABG37IZL2XRI7FG7JXERT6DP64LK3A5CNFSM4FE522EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWNMFA#issuecomment-510449172, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I2WPJ2CWQHWAPHRUL3P64LK3ANCNFSM4FE522EA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Idolized22 commented 5 years ago

You can give the font names in a list like this as part of the command. --fontlist "FreeSerif Italic" "iast_italic" "Lucida Calligraphy Italic" "Times New Roman, Italic" On Thu, Jul 11, 2019 at 5:06 PM Idolized22 @.*> wrote: fonts_for_training I usually do a findfonts step with my training_text to find the correct font names to be used. #!/bin/bash nice text2image --find_fonts \ --fonts_dir ./.fonts \ --text ./langdata/san/san.training_text \ --min_coverage 0.995 \ --render_per_font=false \ --outputbase ./langdata/san/san \ |& grep raw \ | sed -e 's/ :./@ \/g' \ | sed -e "s/^/ '/" \ | sed -e "s/@/'/g" > ./langdata/san/san.fontslist.txt and then I use this fontlist with the training command tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.finetune.training_text \ --output_dir $train_output_dir … <#m144933666443868534> On Thu, Jun 14, 2018 at 3:19 PM Soumik Ranjan Dasgupta < @.> wrote: @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii just so this is clear, training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain with --fonts_dir /path/to/customfonts set to the directory where my custom font files are would do the job, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1672 (comment) <#1672 (comment)>>, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4AhZrgGcda9DCDTKJVaYTtGPvS_ks5t8jHFgaJpZM4UndaI . Hello @Shreeshrii https://github.com/Shreeshrii I am trying to train several custom fonts but i cant figure what should the value of the variable fonts_for_training be/. I am new to Linux and bash script . Thanks a lot in Advance — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1672?email_source=notifications&email_token=ABG37IZL2XRI7FG7JXERT6DP64LK3A5CNFSM4FE522EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWNMFA#issuecomment-510449172>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I2WPJ2CWQHWAPHRUL3P64LK3ANCNFSM4FE522EA .

____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

@Shreeshrii
Thanks for the quick replay, I have more then 100 fonts I need to find a way to hack them into a list without just pasting it to the file as it crashed all the time when I tried it . currently I am getting the following error : Could not find font named Adobe. Pango suggested font FreeMono. Please correct --font arg. ERROR: Program text2image failed. Abort. Done Creating Train Data Changed Directory Back into inital working dir

and first name of the font in my fonts file is: Adobe Blank

and i read the file using:
FontList=$(path_to_fonts_list.txt)

Tara-E commented 5 years ago

Is it possible to use fine tuning to add a new font, but still keep all of the original fonts that were used in the out-of-the-box langdata without retraining those? Or do I need to download all of those original fonts?

Shreeshrii commented 5 years ago

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Shreeshrii commented 5 years ago

first name of the font in my fonts file is: Adobe Blank

Please delete that line. There is no font by that name.