Closed z0tghvunik closed 8 years ago
I have tried this before. Even if you get a binary of text2image for Windows say with cygwin or msys2 it will crash when u run it. There are some incompatibilities with the code and windows. Just accept that it is not available on Windows.
As Quan has suggested you can use jtessboxeditor for generating the box tiff pairs and training.
Or get access to a linux machine.
On 28-Aug-2016 11:32 AM, "z0tghvunik" notifications@github.com wrote:
See guys.. I badly need Text2image.exe but i cannot find it anywhere. Is there any great soul in this world who will take the time to compile that 1 thing and upload it to mediafire or something? This one thing has consumed 5 months of my life :'-( I tried to compile it on windows 32 bit but it gave 100+ errors :'-( Dear c/cpp experts.. Instead of telling everybody how to compile, isn't it a good idea to directly provide a compiled version? I am not any intelligent software eng. I am just a normal human being. Why didn't the developers take some time to upload the compiled binaries? :'-( Somebody please help! I now feel pain in my heart for wasting 5 months of my life for 1 program. Somebody please compile 'Text2Image.cpp' for the needy who don't know how to do it.
P.S I have downloaded Tesseract 3.05 but there does not exist any 'text2image.EXE' :'-(
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/396, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_D9-NssgSMtM4pUJgQllo2hXeBuks5qkSR5gaJpZM4Ju2ym .
@Shreeshrii, there is a new installer on https://github.com/UB-Mannheim/tesseract/wiki. It includes fixes for text2image.exe. If that binary still crashes, I need all information to reproduce the crash.
text2image --fonts_dir= --text ./langdata/ara.training_text --font Arial --outputbase ara.Arial.exp0
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font='MS Serif Bold' --fonts_dir=
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Could not find font named 'MS.Please correct --font arg.
text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
@stweil Thank you for providing the updated binary for text2image - many problems have indeed been fixed since I last looked at it. Thanks to the developers.
However, it crashed under two situations today.
text2image --fonts_dir= --text ./langdata/ara.training_text --font Arial --outputbase ara.Arial.exp0
text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
I will test further and post more feedback later.
Here is a copy of the terminal log with all commands I tried and their output. testlog1.txt
The following command is creating the box-tiff pairs with degradation as well as differnt exposure levels as indicated ..
text2image --fonts_dir= --text ./langdata/san.training_text --ptsize=32 --degrade_image=1 --leading=32 --char_spacing=0.0 --strip_unrenderable_words --underline_start_prob=.05 --underline_continuation_prob=.01 --font Kokila --outputbase san.Kokila.exp-1 --exposure=-1
I have done this testing on Windows 10.
I tried the latest version of the program uploaded today on Windows10 and found that it now works but is unstable. It would fail for Arial font and could not find Times New Roman (the two fonts are most commonly used). The boxes in the generated box file were not as tight as they could be.
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Tahoma" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts Program crashed
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Courier New" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rendered page 1 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts Could not find font named Times New Roman.Please correct --font arg.
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Arial Unicode MS Regular" --fonts_dir=C:\Windows\Fonts Rendered page 0 to file vie.arial.exp1.tif Rtl = 0 ,vertical=0
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font='MS Serif Bold' --fonts_dir=
The previous command does not work because CMD on Windows does not handle 'MS Serif Bold'
like a POSIX shell. It passes 'MS
as font. Using "MS Serif Bold"
should fix that.
Could not find font named Times New Roman.Please correct --font arg.
It looks like this error messages can be improved by a line break after the first sentence. I'll send a PR which fixes this small detail.
text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
That command also crashes with SIGSEGV on Linux. This is a bug which needs a fix.
Training on Windows is not officially supported (but we accept patches):
@Shreeshrii @stweil : please create separate issue for command that crash on linux, so we can track it.
If you want to compile text2image for windows using VS2015, you can have a look at a fully automated process at https://github.com/mazoea/te-external-tesseract using the windows CI environment (appveyor.yml).
It might take a you a while to get the grasp of it (hopefully, hours not months) but you will get your text2image version that you can debug (and send PRs to tesseract).
More details:
BUT Those repositories are not forks of others, in case they do not have the latest version, you have to update it. In practice, this means you should checkout tesseract and merge it with latest if not present - for the moment, there should be only a few of them!
Finally, do not expect a bulletproof text2image even after patching - more needs to be done to address several corner cases but you have everything needed for this mission.
@stweil The problem with font not found message was not just of misplaced period. These fonts are there on Windows but text2image is NOT finding them.
C:\Users\User>text2image --text=./langdata/eng.training_txt --outputbase=eng.MSSerifBold.exp0 --font="MS Serif Regular" --fonts_dir=
Could not find font named MS Serif Regular.Please correct --font arg.
C:\Users\User>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Could not find font named Times New Roman.Please correct --font arg.
text2image --list_available_fonts
shows the fonts in the list
60: Arial
867: Times New Roman,
Ok, the above shows that Time New Roman also has a , at end of font name. So I tried with that, and results differ based on order in which the parameters are given etc . eg. --fonts_dir= should be given first .
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman," --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Failed to read file: ./langdata/eng.training_txt
ReadFileToString(filename, out):Error:Assert failed:in file ../../../../training/fileio.cpp, line 85
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_txt --outputbase=eng.Myfont.exp0 --font="Times New Roman," --fonts_dir=
Failed to read file: ./langdata/eng.training_txt
ReadFileToString(filename, out):Error:Assert failed:in file ../../../../training/fileio.cpp, line 85
C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman" --outputbase eng.Times.exp0
Could not find font named Times New Roman.Please correct --font arg.
C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman," --outputbase eng.Times.exp0
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0
@nguyenq
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts
try --font="Arial"
instead of --font=Arial
.
text2image --text=vie-data.txt --outputbase=vie.arial.exp1 --font="Times New Roman" --fonts_dir=C:\Windows\Fonts
Could not find font named Times New Roman.Please correct --font arg.
As @Shreeshrii said, try --font="Times New Roman,"
.
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font=Arial --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
*** PROGRAM CRASHED ***
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font="Arial" --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
*** PROGRAM CRASHED ***
C:\Users\User\Documents\shree>text2image --text=./langdata/eng.training_text --outputbase=vie.arial.exp1 --font="Times New Roman," --fonts_dir=C:\Windows\Fonts
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Stripped 3 unrenderable words
Rendered page 0 to file vie.arial.exp1.tif
Rendered page 1 to file vie.arial.exp1.tif
Rtl = 0 ,vertical=0
both --font="Arial" and --font=Arial lead to program crash, even though Arial is listed as a font when usinf --list_available_fonts --font="Times New Roman," works.
*_PROGRAM CRASHED _ - the error box looks like shown in this image - there is no message on the console.
Unable to open '/tmp/fonts.conf' for writing
seems to be related to the default directory being non-writable under windows.
Setting 'FC_CACHEDIR = c:/your/writable/directory' may help.
or
use "LOCAL_APPDATA_FONTCONFIG_CACHE" location for the cachedir,
@Shreeshrii: "--fonts_dir=" is wong argument
@zdenop OK
Please see my previous comment, in that I have used
--fonts_dir=C:\Windows\Fonts
It still crashes when fontname Arial or "Arial" is used - on windows10.
@zdenop
On Windows10, I get the FcInitiReinitialize failed!!
error when I use --fonts_dir=C:\Windows\Fonts
which does not come when I use --fonts_dir=
C:\Users\User\Documents\shree>text2image --fonts_dir= --text ./langdata/eng.training_text --font "Times New Roman," --outputbase eng.Times.exp0
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0
C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --text ./langdata/eng.training_text --font "Times New Roman," --outputbase eng.Times.exp0
Unable to open '/tmp/fonts.conf' for writing
Fontconfig error: Cannot load default config file
FcInitiReinitialize failed!!
Stripped 3 unrenderable words
Rendered page 0 to file eng.Times.exp0.tif
Rendered page 1 to file eng.Times.exp0.tif
Rtl = 0 ,vertical=0
@Shreeshrii: These errors are comming from external library (Pango/FontConfig?), which are IMO not common on Windows. IMO tessting&issue reporting should be reported there.
On further investigation, I see that https://github.com/tesseract-ocr/tesseract/blob/master/training/pango_font_info.cpp overrides system and fontconfig defaults ..
STRING_PARAM_FLAG(fonts_dir, "/auto/ocr-data/tesstraining/fonts",
"Overrides system default font location");
STRING_PARAM_FLAG(fontconfig_tmpdir, "/tmp",
"Overrides fontconfig default temporary dir");
The FcInitiReinitialize failed!!
error when using --fonts_dir=C:\Windows\Fonts
disappears when specifying fontconfig_tmpdir
in commandline.
When used the first time, it creates fonts.conf
and a cache file in the specified directory which takes some time. After that, there is no delay in building cache.
C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --fontconfig_tmpdir=C:\Users\User\Documents\shree --text ./langdata/san.training_text --outputbase
san.exp-1 --font FreeSerif
Rendered page 0 to file san.exp-1.tif
Rendered page 1 to file san.exp-1.tif
Rtl = 0 ,vertical=0
C:\Users\User\Documents\shree>text2image --fonts_dir=C:\Windows\Fonts --fontconfig_tmpdir=C:\Users\User\Documents\shree --text ./langdata/san.training_text --outputbase
san.exp-1 --font FreeSerif --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --min_coverage=.9
--degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
Rendered page 0 to file san.exp-1.tif
...
Rendered page 11 to file san.exp-1.tif
Rtl = 0 ,vertical=0
As of now, the two errors still unexplained with text2image under Windows are
@zdenop I can test and report errors to Pango/FOntConfig, but tesseract does not provide any error info that I can refer to.
I'm currently working on the problem with Arial. That font is found (otherwise there would be an error message), but results in SIGSEGV - maybe from an assertion. It looks like Windows buffers console messages and fails to print them before raising the SIGSEGV.
Thanks, Stefan.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Aug 31, 2016 at 5:36 PM, Stefan Weil notifications@github.com wrote:
I'm currently working on the problem with Arial. That font is found (otherwise there would be an error message), but results in SIGSEGV - maybe from an assertion. It looks like Windows buffers console messages and fails to print them before raising the SIGSEGV.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/396#issuecomment-243743876, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-7TkoUbxqsA68wy6DYgPsCh1Qqhks5qlW5jgaJpZM4Ju2ym .
The crash with Arial is caused by a bug in function strcasestr (locally implemented only for Windows, Linux uses the correct GLIBC implementation). Any short font name (5 characters or less) will result in a similar crash. I'll send a pull request which fixes this.
PR #406 fixes the problem with Arial (and other fonts with short names) for text2image on Windows.
Problem 2 (use of --find_fonts) is also caused by the buggy strcasestr function and fixed by PR #406:
(gdb) r
Starting program: /usr/x86_64-w64-mingw32/sys-root/mingw/bin/text2image --fonts_dir= --text ./langdata/san.training_text --outputbase san.exp-1 --ptsize=32 --strip_unrenderable_words --fontconfig_refresh_config_file=false --leading=32 --char_spacing=0.0 --exposure=-1 --find_fonts --min_coverage=.9 --degrade_image=1 --underline_start_prob=.05 --underline_continuation_prob=.01
[New Thread 10160.0x3ecc]
Program received signal SIGSEGV, Segmentation fault.
0x000000000040e089 in strcasestr (haystack=0x303afd0 "Arial", needle=0x43ba5d <tesseract::kDefaultResolution+457> "Fraktur") at ../../../../training/../vs2010/port/strcasestr.cpp:63
63 c1 = haystack[i+j];
(gdb) i s
#0 0x000000000040e089 in strcasestr (haystack=0x303afd0 "Arial", needle=0x43ba5d <tesseract::kDefaultResolution+457> "Fraktur") at ../../../../training/../vs2010/port/strcasestr.cpp:63
#1 0x00000000004081c4 in tesseract::PangoFontInfo::ParseFontDescription (this=0x22f770, desc=0x3020000) at ../../../../training/pango_font_info.cpp:237
#2 0x0000000000408242 in tesseract::PangoFontInfo::ParseFontDescriptionName (this=0x22f770, name=...) at ../../../../training/pango_font_info.cpp:243
#3 0x000000000040a9ca in tesseract::StringRenderer::set_font (this=0x22f770, desc=...) at ../../../../training/stringrenderer.cpp:134
#4 0x000000000040a944 in tesseract::StringRenderer::StringRenderer (this=0x22f770, font_desc=..., page_width=3600, page_height=4800) at ../../../../training/stringrenderer.cpp:128
#5 0x0000000000402b61 in main (argc=1, argv=0x30389d0) at ../../../../training/text2image.cpp:462
(gdb) p i
$1 = 393264
(gdb) p length_haystack
$2 = 18446744073709551615
@nguyenq
The boxes in the generated box file were not as tight as they could be.
What do you mean? Is this also happening in Linux?
The latest version has fixed the issue with Arial font. Thank you.
Clearly, the tool produces inconsistencies in font names. Why is "Times New Roman," a valid name, especially it's a plain style?
298: Times New Roman, 299: Times New Roman, Bold 300: Times New Roman, Bold Italic 301: Times New Roman, Italic 302: Trebuchet MS 303: Trebuchet MS Bold 304: Trebuchet MS Bold Oblique 305: Trebuchet MS Oblique 306: Verdana 307: Verdana Bold 308: Verdana Bold Oblique 309: Verdana Oblique 310: Yu Gothic 311: Yu Gothic Bold 312: Yu Gothic Bold Oblique 313: Yu Gothic Light, Light 314: Yu Gothic Medium, Medium 315: Yu Gothic Medium, Medium Oblique 316: Yu Gothic Oblique
@amitdo Almost all the generated boxes (created in Windows 10) are consistently a bit low and a bit wide. It was reported that having tightly fitted boxes would improve the quality of the generated traineddata file.
@Shreeshrii: I need to correct my statement:
"--fonts_dir=" is wong argument
- I found out it is interpreted as --fonts_dir=""
- I found out that
--fonts_dir=""
reset fonts_dir variable to system default e.g. if --fonts_dir argument is not use text2image is looking for fonts in/auto/ocr-data/tesstraining/fonts
@stweil
Thank you for the changes to get text2image working on windows and for making the latest version available via installer at https://github.com/UB-Mannheim/tesseract/wiki
I have added a link to the same from https://github.com/tesseract-ocr/tesseract/wiki so that it is easily accessible.
Hi I have downloaded jtessboxeditor and extracted the files. I dowloaded the java runtime environment too. I have opened the jtessboxeditor.jar file, is getting popped up, but can't accessible. I used the same application yesterday but today i am facing this issue. Can anyone help me to sort out this issue.
@shobamohan123 Please post your issue or question related to jTessBoxEditor in the appropriate box in either https://sourceforge.net/p/vietocr/discussion or https://github.com/nguyenq.
Thanks.
This is work for me
# list avaiable font
text2image --fontconfig_tmpdir=. -text my.txt --outputbase test.exp0 --fonts_dir="C:\xxx\myDir" --list_available_fonts
# Start
text2image --fontconfig_tmpdir=. -text my.txt --outputbase test.exp0 --fonts_dir="C:\xxx\myDir" --font myFont --ptsize 36
See guys.. I badly need Text2image.exe but i cannot find it anywhere. Is there any great soul in this world who will take the time to compile that 1 thing and upload it to mediafire or something? This one thing has consumed 5 months of my life :'-( I tried to compile it on windows 32 bit but it gave 100+ errors :'-( Dear c/cpp experts.. Instead of telling everybody how to compile, isn't it a good idea to directly provide a compiled version? I am not any intelligent software eng. I am just a normal human being. Why didn't the developers take some time to upload the compiled binaries? :'-( Somebody please help! I now feel pain in my heart for wasting 5 months of my life for 1 program. Somebody please compile 'Text2Image.cpp' for the needy who don't know how to do it.
P.S I have downloaded Tesseract 3.05 but there does not exist any 'text2image.EXE' :'-(