tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.4k forks source link

1 Text2Image.exe binary please? #396

Closed z0tghvunik closed 8 years ago

z0tghvunik commented 8 years ago

See guys.. I badly need Text2image.exe but i cannot find it anywhere. Is there any great soul in this world who will take the time to compile that 1 thing and upload it to mediafire or something? This one thing has consumed 5 months of my life :'-( I tried to compile it on windows 32 bit but it gave 100+ errors :'-( Dear c/cpp experts.. Instead of telling everybody how to compile, isn't it a good idea to directly provide a compiled version? I am not any intelligent software eng. I am just a normal human being. Why didn't the developers take some time to upload the compiled binaries? :'-( Somebody please help! I now feel pain in my heart for wasting 5 months of my life for 1 program. Somebody please compile 'Text2Image.cpp' for the needy who don't know how to do it.

P.S I have downloaded Tesseract 3.05 but there does not exist any 'text2image.EXE' :'-(

stweil commented 8 years ago
Shreeshrii commented 8 years ago

I have tried this before. Even if you get a binary of text2image for Windows say with cygwin or msys2 it will crash when u run it. There are some incompatibilities with the code and windows. Just accept that it is not available on Windows.

As Quan has suggested you can use jtessboxeditor for generating the box tiff pairs and training.

Or get access to a linux machine.

On 28-Aug-2016 11:32 AM, "z0tghvunik" notifications@github.com wrote:

See guys.. I badly need Text2image.exe but i cannot find it anywhere. Is there any great soul in this world who will take the time to compile that 1 thing and upload it to mediafire or something? This one thing has consumed 5 months of my life :'-( I tried to compile it on windows 32 bit but it gave 100+ errors :'-( Dear c/cpp experts.. Instead of telling everybody how to compile, isn't it a good idea to directly provide a compiled version? I am not any intelligent software eng. I am just a normal human being. Why didn't the developers take some time to upload the compiled binaries? :'-( Somebody please help! I now feel pain in my heart for wasting 5 months of my life for 1 program. Somebody please compile 'Text2Image.cpp' for the needy who don't know how to do it.

P.S I have downloaded Tesseract 3.05 but there does not exist any 'text2image.EXE' :'-(

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/396, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_D9-NssgSMtM4pUJgQllo2hXeBuks5qkSR5gaJpZM4Ju2ym .

stweil commented 8 years ago

@Shreeshrii, there is a new installer on https://github.com/UB-Mannheim/tesseract/wiki. It includes fixes for text2image.exe. If that binary still crashes, I need all information to reproduce the crash.

Shreeshrii commented 8 years ago

image

text2image --fonts_dir= --text ./langdata/ara.training_text --font Arial --outputbase ara.Arial.exp0

Shreeshrii commented 8 years ago
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font='MS Serif Bold' --fonts_dir=
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Could not find font named 'MS.Please correct --font arg.
Shreeshrii commented 8 years ago

image

text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01

Shreeshrii commented 8 years ago

@stweil Thank you for providing the updated binary for text2image - many problems have indeed been fixed since I last looked at it. Thanks to the developers.

However, it crashed under two situations today.

  1. when using font Arial, I tried with eng, ara and san - not sure what causes this, as when font is not found an error is displayed.

text2image --fonts_dir= --text ./langdata/ara.training_text --font Arial --outputbase ara.Arial.exp0

  1. when trying to find fonts and create images for a particular text -

text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01

I will test further and post more feedback later.

Shreeshrii commented 8 years ago

Here is a copy of the terminal log with all commands I tried and their output. testlog1.txt

Shreeshrii commented 8 years ago

The following command is creating the box-tiff pairs with degradation as well as differnt exposure levels as indicated ..

text2image --fonts_dir= --text ./langdata/san.training_text --ptsize=32 --degrade_image=1 --leading=32 --char_spacing=0.0 --strip_unrenderable_words --underline_start_prob=.05 --underline_continuation_prob=.01 --font Kokila --outputbase san.Kokila.exp-1 --exposure=-1

I have done this testing on Windows 10.

Shreeshrii commented 8 years ago

san.Kokila.zip

nguyenq commented 8 years ago

I tried the latest version of the program uploaded today on Windows10 and found that it now works but is unstable. It would fail for Arial font and could not find Times New Roman (the two fonts are most commonly used). The boxes in the generated box file were not as tight as they could be.

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Tahoma" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts Program crashed

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Courier New" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rendered page 1 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts Could not find font named Times New Roman.Please correct --font arg.

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Arial Unicode MS Regular" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0

stweil commented 8 years ago

C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font='MS Serif Bold' --fonts_dir=

The previous command does not work because CMD on Windows does not handle 'MS Serif Bold' like a POSIX shell. It passes 'MS as font. Using "MS Serif Bold" should fix that.

Could not find font named Times New Roman.Please correct --font arg.

It looks like this error messages can be improved by a line break after the first sentence. I'll send a PR which fixes this small detail.

text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01

That command also crashes with SIGSEGV on Linux. This is a bug which needs a fix.

zdenop commented 8 years ago

Training on Windows is not officially supported (but we accept patches):

  1. Some tools use libraries that are not common on Windows. I do not think it is worth to invest time in testing windows implementation of these libraries.
  2. Training is need only in special cases. In such cases it is much efficient to use VirtualBox (or similar tool) with linux.
  3. We are not aware about anybody who would like to provide Windows support.

@Shreeshrii @stweil : please create separate issue for command that crash on linux, so we can track it.

vidiecan commented 8 years ago

If you want to compile text2image for windows using VS2015, you can have a look at a fully automated process at https://github.com/mazoea/te-external-tesseract using the windows CI environment (appveyor.yml).

It might take a you a while to get the grasp of it (hopefully, hours not months) but you will get your text2image version that you can debug (and send PRs to tesseract).

More details:

  1. all the external dependencies are at https://github.com/mazoea/te-external
  2. read the Readme to understand the structure used throughout the process
  3. see https://github.com/mazoea/te-external/blob/master/appveyor.yml for the real commands and also check the logs by clicking on the build badge in the repository
  4. the same goes for https://github.com/mazoea/te-external-leptonica
  5. the same goes for https://github.com/mazoea/te-external-tesseract
  6. binaries will be in tesseract\projects\output
  7. look at #381

BUT Those repositories are not forks of others, in case they do not have the latest version, you have to update it. In practice, this means you should checkout tesseract and merge it with latest if not present - for the moment, there should be only a few of them!

Finally, do not expect a bulletproof text2image even after patching - more needs to be done to address several corner cases but you have everything needed for this mission.

Shreeshrii commented 8 years ago

@stweil The problem with font not found message was not just of misplaced period. These fonts are there on Windows but text2image is NOT finding them.

C:\Users\User>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font="MS Serif Regular" --fonts_dir=
Could not find font named MS Serif Regular.Please correct --font arg.
C:\Users\User>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Could not find font named Times New Roman.Please correct --font arg.

text2image --list_available_fonts shows the fonts in the list

 60: Arial

867: Times New Roman,

Ok, the above shows that Time New Roman also has a , at end of font name. So I tried with that, and results differ based on order in which the parameters are given etc . eg. --fonts_dir= should be given first .

C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman," --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Failed to read file: ./langdata/eng.training_txt
ReadFileToString(filename, out):Error:Assert failed:in file ../../../../training/fileio.cpp, line 85

C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman," --fonts_dir=
Failed to read file: ./langdata/eng.training_txt
ReadFileToString(filename, out):Error:Assert failed:in file ../../../../training/fileio.cpp, line 85

C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman"  --outputbase eng.Times.exp0
Could not find font named Times New Roman.Please correct --font arg.

C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman,"  --outputbase eng.Times.exp0
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0
amitdo commented 8 years ago

@nguyenq

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts

try --font="Arial" instead of --font=Arial.

text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts
Could not find font named Times New Roman.Please correct --font arg.

As @Shreeshrii said, try --font="Times New Roman,".

Shreeshrii commented 8 years ago
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
*** PROGRAM CRASHED ***

C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font="Arial" --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
*** PROGRAM CRASHED ***

C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font="Times New Roman," --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Stripped 3 unrenderable words
Rendered page 0 to file vie.arial.exp1.tif
Rendered page 1 to file vie.arial.exp1.tif
Rtl = 0 ,vertical=0

both --font="Arial" and --font=Arial lead to program crash, even though Arial is listed as a font when usinf --list_available_fonts --font="Times New Roman," works.

*_PROGRAM CRASHED _ - the error box looks like shown in this image - there is no message on the console.

image

Unable to open '/tmp/fonts.conf' for writing seems to be related to the default directory being non-writable under windows.

Setting 'FC_CACHEDIR = c:/your/writable/directory' may help.

or

use "LOCAL_APPDATA_FONTCONFIG_CACHE" location for the cachedir,

ref: https://bugs.launchpad.net/inkscape/+bug/1196373

zdenop commented 8 years ago

@Shreeshrii: "--fonts_dir=" is wong argument

Shreeshrii commented 8 years ago

@zdenop OK

Please see my previous comment, in that I have used --fonts_dir=C:\Windows\Fonts It still crashes when fontname Arial or "Arial" is used - on windows10.

Shreeshrii commented 8 years ago

@zdenop

On Windows10, I get the FcInitiReinitialize failed!! error when I use --fonts_dir=C:\Windows\Fonts which does not come when I use --fonts_dir=

C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman,"  --outputbase eng.Times.exp0
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0

C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --text ./langdata/eng.training_text --font "Times New Roman,"  --outputbase eng.Times.exp0
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0
zdenop commented 8 years ago

@Shreeshrii: These errors are comming from external library (Pango/FontConfig?), which are IMO not common on Windows. IMO tessting&issue reporting should be reported there.

Shreeshrii commented 8 years ago

On further investigation, I see that https://github.com/tesseract-ocr/tesseract/blob/master/training/pango_font_info.cpp overrides system and fontconfig defaults ..

STRING_PARAM_FLAG(fonts_dir, "/auto/ocr-data/tesstraining/fonts",
                  "Overrides system default font location");
STRING_PARAM_FLAG(fontconfig_tmpdir, "/tmp",
                  "Overrides fontconfig default temporary dir");

The FcInitiReinitialize failed!! error when using --fonts_dir=C:\Windows\Fonts disappears when specifying fontconfig_tmpdir in commandline.

When used the first time, it creates fonts.conf and a cache file in the specified directory which takes some time. After that, there is no delay in building cache.

C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --fontconfig_tmpdir=C:\Users\User\Documents\shree --text ./langdata/san.training_text  --outputbase
san.exp-1  --font FreeSerif
Rendered page 0 to file san.exp-1.tif
Rendered page 1 to file san.exp-1.tif
Rtl = 0 ,vertical=0

C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --fontconfig_tmpdir=C:\Users\User\Documents\shree --text ./langdata/san.training_text  --outputbase
san.exp-1  --font FreeSerif --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --min_coverage=.9
--degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
Rendered page 0 to file san.exp-1.tif
...
Rendered page 11 to file san.exp-1.tif
Rtl = 0 ,vertical=0
Shreeshrii commented 8 years ago

As of now, the two errors still unexplained with text2image under Windows are

  1. Use of Arial font
  2. Use of --find_fonts

@zdenop I can test and report errors to Pango/FOntConfig, but tesseract does not provide any error info that I can refer to.

stweil commented 8 years ago

I'm currently working on the problem with Arial. That font is found (otherwise there would be an error message), but results in SIGSEGV - maybe from an assertion. It looks like Windows buffers console messages and fails to print them before raising the SIGSEGV.

Shreeshrii commented 8 years ago

Thanks, Stefan.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 31, 2016 at 5:36 PM, Stefan Weil notifications@github.com wrote:

I'm currently working on the problem with Arial. That font is found (otherwise there would be an error message), but results in SIGSEGV - maybe from an assertion. It looks like Windows buffers console messages and fails to print them before raising the SIGSEGV.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/396#issuecomment-243743876, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-7TkoUbxqsA68wy6DYgPsCh1Qqhks5qlW5jgaJpZM4Ju2ym .

stweil commented 8 years ago

The crash with Arial is caused by a bug in function strcasestr (locally implemented only for Windows, Linux uses the correct GLIBC implementation). Any short font name (5 characters or less) will result in a similar crash. I'll send a pull request which fixes this.

stweil commented 8 years ago

PR #406 fixes the problem with Arial (and other fonts with short names) for text2image on Windows.

stweil commented 8 years ago

Problem 2 (use of --find_fonts) is also caused by the buggy strcasestr function and fixed by PR #406:

(gdb) r
Starting program: /usr/x86_64-w64-mingw32/sys-root/mingw/bin/text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
[New Thread 10160.0x3ecc]

Program received signal SIGSEGV, Segmentation fault.
0x000000000040e089 in strcasestr (haystack=0x303afd0 "Arial", needle=0x43ba5d <tesseract::kDefaultResolution+457> "Fraktur") at ../../../../training/../vs2010/port/strcasestr.cpp:63
63                  c1 = haystack[i+j];
(gdb) i s
#0  0x000000000040e089 in strcasestr (haystack=0x303afd0 "Arial", needle=0x43ba5d <tesseract::kDefaultResolution+457> "Fraktur") at ../../../../training/../vs2010/port/strcasestr.cpp:63
#1  0x00000000004081c4 in tesseract::PangoFontInfo::ParseFontDescription (this=0x22f770, desc=0x3020000) at ../../../../training/pango_font_info.cpp:237
#2  0x0000000000408242 in tesseract::PangoFontInfo::ParseFontDescriptionName (this=0x22f770, name=...) at ../../../../training/pango_font_info.cpp:243
#3  0x000000000040a9ca in tesseract::StringRenderer::set_font (this=0x22f770, desc=...) at ../../../../training/stringrenderer.cpp:134
#4  0x000000000040a944 in tesseract::StringRenderer::StringRenderer (this=0x22f770, font_desc=..., page_width=3600, page_height=4800) at ../../../../training/stringrenderer.cpp:128
#5  0x0000000000402b61 in main (argc=1, argv=0x30389d0) at ../../../../training/text2image.cpp:462
(gdb) p i
$1 = 393264
(gdb) p length_haystack
$2 = 18446744073709551615
amitdo commented 8 years ago

@nguyenq

The boxes in the generated box file were not as tight as they could be.

What do you mean? Is this also happening in Linux?

nguyenq commented 8 years ago

The latest version has fixed the issue with Arial font. Thank you.

Clearly, the tool produces inconsistencies in font names. Why is "Times New Roman," a valid name, especially it's a plain style?

298: Times New Roman, 299: Times New Roman, Bold 300: Times New Roman, Bold Italic 301: Times New Roman, Italic 302: Trebuchet MS 303: Trebuchet MS Bold 304: Trebuchet MS Bold Oblique 305: Trebuchet MS Oblique 306: Verdana 307: Verdana Bold 308: Verdana Bold Oblique 309: Verdana Oblique 310: Yu Gothic 311: Yu Gothic Bold 312: Yu Gothic Bold Oblique 313: Yu Gothic Light, Light 314: Yu Gothic Medium, Medium 315: Yu Gothic Medium, Medium Oblique 316: Yu Gothic Oblique

@amitdo Almost all the generated boxes (created in Windows 10) are consistently a bit low and a bit wide. It was reported that having tightly fitted boxes would improve the quality of the generated traineddata file.

image

zdenop commented 8 years ago

@Shreeshrii: I need to correct my statement:

"--fonts_dir=" is wong argument

  1. I found out it is interpreted as --fonts_dir=""
  2. I found out that --fonts_dir="" reset fonts_dir variable to system default e.g. if --fonts_dir argument is not use text2image is looking for fonts in /auto/ocr-data/tesstraining/fonts
Shreeshrii commented 8 years ago

@stweil

Thank you for the changes to get text2image working on windows and for making the latest version available via installer at https://github.com/UB-Mannheim/tesseract/wiki

I have added a link to the same from https://github.com/tesseract-ocr/tesseract/wiki so that it is easily accessible.

shobamohan123 commented 6 years ago

Hi I have downloaded jtessboxeditor and extracted the files. I dowloaded the java runtime environment too. I have opened the jtessboxeditor.jar file, is getting popped up, but can't accessible. I used the same application yesterday but today i am facing this issue. Can anyone help me to sort out this issue.

nguyenq commented 6 years ago

@shobamohan123 Please post your issue or question related to jTessBoxEditor in the appropriate box in either https://sourceforge.net/p/vietocr/discussion or https://github.com/nguyenq.

Thanks.

CarsonSlovoka commented 1 year ago

This is work for me

# list avaiable font
text2image --fontconfig_tmpdir=. -text my.txt --outputbase test.exp0 --fonts_dir="C:\xxx\myDir" --list_available_fonts
# Start
text2image --fontconfig_tmpdir=. -text my.txt --outputbase test.exp0 --fonts_dir="C:\xxx\myDir" --font myFont --ptsize 36