"Phase UP: Generating unicharset and unichar properties files" ERROR

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.37k stars 9.52k forks source link

"Phase UP: Generating unicharset and unichar properties files" ERROR #1147

Closed ivanzz1001 closed 6 years ago

ivanzz1001 commented 7 years ago

Environment

Tesseract Version: tesseract 4.00.00alpha
Commit Number: 2cc531e
Platform: Linux localhost.localdomain 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I excute the following command:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ../tessdata \
  --fontlist "AR PL UKai CN" \
  "AR PL UKai HK" \
  "AR PL UKai TW" \
  "AR PL UKai TW MBE" \
  "AR PL UMing CN Semi-Light" \
  "AR PL UMing HK Semi-Light" \
  "AR PL UMing TW MBE Semi-Light" \
  "AR PL UMing TW Semi-Light" \
  "Arial Unicode MS" \
  "FangSong" \
  "KaiTi" \
  "LiSu" \
  "Microsoft YaHei" \
  "Microsoft YaHei Bold" \
  "NSimSun" \
  "Noto Sans SC" \
  "Noto Sans SC Bold" \
  "Noto Sans SC Heavy" \
  "Noto Sans SC Medium" \
  "Noto Sans SC Medium" \
  "Noto Sans SC Semi-Light" \
  "Noto Sans SC Semi-Light" \
  "STFangsong" \
  "STKaiti" \
  "STSong" \
  "STXihei" \
  "STXinwei" \
  "STZhongsong" \
  "SimHei" \
  "SimSun" \
  "WenQuanYi Micro Hei" \
  "WenQuanYi Micro Hei Mono" \
  "WenQuanYi Zen Hei Medium" \
  "WenQuanYi Zen Hei Mono Medium" \
  "WenQuanYi Zen Hei Sharp Medium" \
  "YouYuan" \
  --output_dir ../tesstutorial/chieval \
  --overwrite

But when it goes to Phase UP, it generates the following error:

=== Phase UP: Generating unicharset and unichar properties files ===
which: no unicharset_extractor in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no unicharset_extractor in (./api)
[Thu Sep 21 02:23:48 PDT 2017] /root/tesseract-src/tesseract-master/training/unicharset_extractor --output_unicharset /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.unicharset --norm_mode 1 /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UKai_CN.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UKai_HK.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UKai_TW.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UKai_TW_MBE.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UMing_CN_Semi-Light.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UMing_HK_Semi-Light.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UMing_TW_MBE_Semi-Light.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UMing_TW_Semi-Light.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.FangSong.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.KaiTi.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.LiSu.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Microsoft_YaHei_Bold.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Microsoft_YaHei.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Noto_Sans_SC_Bold.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Noto_Sans_SC.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Noto_Sans_SC_Heavy.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Noto_Sans_SC_Medium.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.Noto_Sans_SC_Semi-Light.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.NSimSun.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.SimHei.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.SimSun.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STFangsong.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STKaiti.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STSong.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STXihei.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STXinwei.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.STZhongsong.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.WenQuanYi_Micro_Hei.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.WenQuanYi_Micro_Hei_Mono.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.WenQuanYi_Zen_Hei_Mono_Medium.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.WenQuanYi_Zen_Hei_Sharp_Medium.exp0.box /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.YouYuan.exp0.box
Extracting unicharset from box file /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.AR_PL_UKai_CN.exp0.box
Invalid Unicode codepoint: 0xffffffe8
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
ERROR: /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.unicharset does not exist or is not readable

1) In https://github.com/tesseract-ocr/langdata, I can't find my chi_sim.unicharset。 But I found it in the tesseract source directory: tesseract-master/testdata/chi_sim.unicharset。But It appears that in https://github.com/tesseract-ocr/langdata, there's the following two files: Han.unicharset, Han.xheights; It also looks like the Chinese language. Now which unicharset file I should choose? And where I need to place it ? 2) Now I just copy tesseract-master/testdata/chi_sim.unicharset to langdata/chi_sim/ directory, but it has the following problems:

ERROR: /tmp/tmp.Tf9BFjjy6w/chi_sim/chi_sim.unicharset does not exist or is not readable

3) in the temporary directory "/tmp/tmp.Tf9BFjjy6w/chi_sim",I have the following files:

[root@localhost tesseract-master]# ls /tmp/tmp.Tf9BFjjy6w/chi_sim/
chi_sim.AR_PL_UKai_CN.exp0.box                  chi_sim.Microsoft_YaHei_Bold.exp0.tif     chi_sim.STSong.exp0.box
chi_sim.AR_PL_UKai_CN.exp0.tif                  chi_sim.Microsoft_YaHei.exp0.box          chi_sim.STSong.exp0.tif
chi_sim.AR_PL_UKai_HK.exp0.box                  chi_sim.Microsoft_YaHei.exp0.tif          chi_sim.STXihei.exp0.box
chi_sim.AR_PL_UKai_HK.exp0.tif                  chi_sim.Noto_Sans_SC_Bold.exp0.box        chi_sim.STXihei.exp0.tif
chi_sim.AR_PL_UKai_TW.exp0.box                  chi_sim.Noto_Sans_SC_Bold.exp0.tif        chi_sim.STXinwei.exp0.box
chi_sim.AR_PL_UKai_TW.exp0.tif                  chi_sim.Noto_Sans_SC.exp0.box             chi_sim.STXinwei.exp0.tif
chi_sim.AR_PL_UKai_TW_MBE.exp0.box              chi_sim.Noto_Sans_SC.exp0.tif             chi_sim.STZhongsong.exp0.box
chi_sim.AR_PL_UKai_TW_MBE.exp0.tif              chi_sim.Noto_Sans_SC_Heavy.exp0.box       chi_sim.STZhongsong.exp0.tif
chi_sim.AR_PL_UMing_CN_Semi-Light.exp0.box      chi_sim.Noto_Sans_SC_Heavy.exp0.tif       chi_sim.WenQuanYi_Micro_Hei.exp0.box
chi_sim.AR_PL_UMing_CN_Semi-Light.exp0.tif      chi_sim.Noto_Sans_SC_Medium.exp0.box      chi_sim.WenQuanYi_Micro_Hei.exp0.tif
chi_sim.AR_PL_UMing_HK_Semi-Light.exp0.box      chi_sim.Noto_Sans_SC_Medium.exp0.tif      chi_sim.WenQuanYi_Micro_Hei_Mono.exp0.box
chi_sim.AR_PL_UMing_HK_Semi-Light.exp0.tif      chi_sim.Noto_Sans_SC_Semi-Light.exp0.box  chi_sim.WenQuanYi_Micro_Hei_Mono.exp0.tif
chi_sim.AR_PL_UMing_TW_MBE_Semi-Light.exp0.box  chi_sim.Noto_Sans_SC_Semi-Light.exp0.tif  chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box
chi_sim.AR_PL_UMing_TW_MBE_Semi-Light.exp0.tif  chi_sim.NSimSun.exp0.box                  chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.tif
chi_sim.AR_PL_UMing_TW_Semi-Light.exp0.box      chi_sim.NSimSun.exp0.tif                  chi_sim.WenQuanYi_Zen_Hei_Mono_Medium.exp0.box
chi_sim.AR_PL_UMing_TW_Semi-Light.exp0.tif      chi_sim.SimHei.exp0.box                   chi_sim.WenQuanYi_Zen_Hei_Mono_Medium.exp0.tif
chi_sim.FangSong.exp0.box                       chi_sim.SimHei.exp0.tif                   chi_sim.WenQuanYi_Zen_Hei_Sharp_Medium.exp0.box
chi_sim.FangSong.exp0.tif                       chi_sim.SimSun.exp0.box                   chi_sim.WenQuanYi_Zen_Hei_Sharp_Medium.exp0.tif
chi_sim.KaiTi.exp0.box                          chi_sim.SimSun.exp0.tif                   chi_sim.YouYuan.exp0.box
chi_sim.KaiTi.exp0.tif                          chi_sim.STFangsong.exp0.box               chi_sim.YouYuan.exp0.tif
chi_sim.LiSu.exp0.box                           chi_sim.STFangsong.exp0.tif               tesstrain.log
chi_sim.LiSu.exp0.tif                           chi_sim.STKaiti.exp0.box
chi_sim.Microsoft_YaHei_Bold.exp0.box           chi_sim.STKaiti.exp0.tif

I use an image tool and open the .tif files, but I found it contains only part of the chi_sim.training_text's content(langdata/chi_sim), that's why? how do I fix it?

Expected Behavior:

Suggested Fix:

Shreeshrii commented 7 years ago

It is getting errors related to the program unicharset_extractor.

Please see known and still open issue: https://github.com/tesseract-ocr/tesseract/issues/1114

ivanzz1001 commented 7 years ago

@Shreeshrii I think I have found the reason。I have the following chi_sim.training_text( Here I just show you first line):

1996规格器皿 砝2.5、客胫骨发电All 联络 其、鄞州 Education嫉处感谢铁道

And I add some print msg in the following function(tesseract_master/training/unicharset_extractor.cpp):

// Helper normalizes and segments the given strings according to norm_mode, and
// adds the segmented parts to unicharset.
static void AddStringsToUnicharset(const GenericVector& strings,
                                   int norm_mode, UNICHARSET* unicharset) {
  for (int i = 0; i < strings.size(); ++i) {
    std::vector normalized;
    if (NormalizeCleanAndSegmentUTF8(UnicodeNormMode::kNFC, OCRNorm::kNone,
                                     static_cast(norm_mode),
                                     /*report_errors*/ true,
                                     strings[i].string(), &normalized)) {
      tprintf("string: %s\n",strings[i].string());
      for (const string& normed : normalized) {
      tprintf("string2:%s\n",normed.c_str());
        if (normed.empty() || IsWhitespace(normed[0])) continue;
        unicharset->unichar_insert(normed.c_str());
      }
    } else {
      tprintf("Normalization failed for string '%s'\n", strings[i].c_str());
    }
  }
}

It prints the following:

string: 1
string2:1
string: 9
string2:9
string: 9
string2:9
string: 6
string2:6
string: 规
string2:规
Invalid Unicode codepoint: 0xffffffe8
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
ERROR: /tmp/tmp.8gMFI2Gry5/chi_sim/chi_sim.unicharset does not exist or is not readable

Then I write a short test:

#include <stdio.h>
#include <stdlib.h>

#include <string>
using namespace std;

int main(int argc,char *argv[]){
        string s = "规";
        printf("size:%d\n",s.size());
        for(int i=0;i < s.size();i++)
                printf("%x ",s[i]);
        printf("\n");
        return 0;
}

Execute the test:

ivan1001@ceph-admin:~/test-src$ ./test
size:3
ffffffe8 ffffffa7 ffffff84

Now I found "0xffffffe8 0xffffffa7 0xffffff84" is the utf-8 code of "规",not what we expect the Unicode encoding((utf-32) in the unicharset_extractor.cpp program。The Unicode code of "规" is "\u89c4", it is in then range : [0, 0xD800) or [0xE000, 0x10FFFF]

Here I think the program's logic has some problems。 First It use function NormalizeCleanAndSegmentUTF8() to convert the string to UTF-8 encoding , but thereafter it use a "Unicode (utf-32)" function IsValidCodepoint() to check the utf-8's result.

Please check, and how do I post the bug to the developers or could you help me?

Shreeshrii commented 7 years ago

@ivanzz1001 Thanks for looking into this issue and finding possible reason. I do not know enough about tesseract and c++ to comment.

Tagging @theraysmith and @stweil - related issue https://github.com/tesseract-ocr/tesseract/issues/1114

I had also thought that it was related in someway to the conversion of utf8 training text to the utf32 format, but did not know how to check for it.

ivanzz1001 commented 7 years ago

@Shreeshrii @stweil The following function:

static void AddStringsToUnicharset(const GenericVector& strings,
                                   int norm_mode, UNICHARSET* unicharset) {
  for (int i = 0; i < strings.size(); ++i) {
    std::vector normalized;
    if (NormalizeCleanAndSegmentUTF8(UnicodeNormMode::kNFC, OCRNorm::kNone,
                                     static_cast(norm_mode),
                                     /*report_errors*/ true,
                                     strings[i].string(), &normalized)) {
      for (const string& normed : normalized) {
        if (normed.empty() || IsWhitespace(normed[0])) continue;
        unicharset->unichar_insert(normed.c_str());
      }
    } else {
      tprintf("Normalization failed for string '%s'\n", strings[i].c_str());
    }
  }
}

It may need to change to:

if (normed.empty() || IsUTF8Whitespace(normed[0])) continue;

amitdo commented 7 years ago

First It use function NormalizeCleanAndSegmentUTF8() to convert the string to UTF-8 encoding

The first thing it does is calling NormalizeUTF8ToUTF32().

ivanzz1001 commented 7 years ago

@amitdo Last it has changed back to UTF-8:

bool NormalizeCleanAndSegmentUTF8(UnicodeNormMode u_mode, OCRNorm ocr_normalize,
                                  GraphemeNormMode g_mode, bool report_errors,
                                  const char* str8,
                                  std::vector* graphemes) {
  std::vector normed32;
  NormalizeUTF8ToUTF32(u_mode, ocr_normalize, str8, &normed32);
  StripJoiners(&normed32);
  std::vector> graphemes32;
  bool success = Validator::ValidateCleanAndSegment(g_mode, report_errors,
                                                    normed32, &graphemes32);
  if (g_mode != GraphemeNormMode::kSingleString && success) {
    // If we modified the string to clean it up, the segmentation may not be
    // correct, so check for changes and do it again.
    std::vector cleaned32;
    for (const auto& g : graphemes32) {
      cleaned32.insert(cleaned32.end(), g.begin(), g.end());
    }
    if (cleaned32 != normed32) {
      graphemes32.clear();
      success = Validator::ValidateCleanAndSegment(g_mode, report_errors,
                                                   cleaned32, &graphemes32);
    }
  }
  graphemes->clear();
  graphemes->reserve(graphemes32.size());
  for (const auto& grapheme : graphemes32) {
    graphemes->push_back(UNICHAR::UTF32ToUTF8(grapheme));
  }
  return success;
}

ivanzz1001 commented 7 years ago

@Shreeshrii I had used the changed source code(tesseract_master/training/unicharset_extractor.cpp):

if (normed.empty() || IsUTF8Whitespace(normed.c_str())) continue;

and had generated the .lstmf files. It seems the above modify is OK. But I haven't finished all the training steps so I can't absolutely make sure it works ok.

ivanzz1001 commented 7 years ago

@Shreeshrii BTW,when I use the training/tesstrain.sh to generate chi_sim_XXX.lstmf file， why should "--tessdata_dir ../tessdata" directory must contain eng.traineddata? if the directory does't contains the eng.traineddata, It will continuously find the file, but here I just want to train sim_chi

Shreeshrii commented 7 years ago

tesseract checks for osd and eng.traineddata at start of program. It has been there for many years and hasn't been changed even though now it handles many more languages.

The developers have other higher priority bugs to fix/ features to add, and so the requirement for eng.traineddata remains.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Sep 26, 2017 at 3:44 PM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii BTW,when I use the training/tesstrain.sh to generate chi_sim_XXX.lstmf file， why should "--tessdata_dir ../tessdata" must contains eng.traineddata? if the directory does't contains the eng.traineddata, It will continuously find the file, but here I just want to train sim_chi

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332152832, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-wkkcLQjDbfqVYiiuYoKjSk0YxCks5smM5-gaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

@Shreeshrii when I execute the "training/tesstrain.sh " and later meets the following:

=== Constructing LSTM training data ===
which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no combine_lang_model in (./api)
[Tue Sep 26 21:17:16 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.LASR8IGnop/chi_sim/chi_sim.unicharset --script_dir ../langdata --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chieval --lang chi_sim
Loaded unicharset of size 5074 from file /tmp/tmp.LASR8IGnop/chi_sim/chi_sim.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 106 = ，
Config file is optional, continuing...
Null char=2
Invalid format in radical table at line 0: 19886 3 23 6 3
Creation of encoded unicharset failed!!
Error writing recoder!!
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!

I have used the chi_sim_vert.traineddata from the tessdata_best directory, is it the reason cause the problems above? Here #842 you said:

modify chi_sim.config file in langdata/chi_sim
and comment out first line related to loading of the vertical sub language

but I didn't modify it and used the chi_sim_vert.trainneddata from the tessdata_best directrory.

ivanzz1001 commented 7 years ago

@Shreeshrii Or should I try my own chi_sim_vert.traineddata first?

Shreeshrii commented 7 years ago

Here #842 you said:

modify chi_sim.config file in langdata/chi_sim

and comment out first line related to loading of the vertical sub language

I think at that time, there was no chi_sim_vert traineddata available.

I will try out the command at my end to see what error I get.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Sep 27, 2017 at 11:48 AM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii Or should I try my own chi_sim_vert.traineddata first?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332420444, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxMyOZwSj81sdhBzebAx3veBXFDEks5smei9gaJpZM4PfFg_ .

Shreeshrii commented 7 years ago

Invalid format in radical table at line 0: 19886 3 23 6 3

Do you have the latest version of

https://github.com/tesseract-ocr/langdata/blob/master/radical-stroke.txt

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Sep 27, 2017 at 12:52 PM, ShreeDevi Kumar shreeshrii@gmail.com wrote:

Here #842 you said:

modify chi_sim.config file in langdata/chi_sim

and comment out first line related to loading of the vertical sub language

I think at that time, there was no chi_sim_vert traineddata available.

I will try out the command at my end to see what error I get.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Sep 27, 2017 at 11:48 AM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii Or should I try my own chi_sim_vert.traineddata first?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332420444, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxMyOZwSj81sdhBzebAx3veBXFDEks5smei9gaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

@Shreeshrii Yes, I have downloaded the latest version

Shreeshrii commented 7 years ago

Looks like you do not have the following program.

which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no combine_lang_model in (./api)

Please check. This is required for building starter traineddata

On 27-Sep-2017 1:13 PM, "ivanzz1001" notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii Yes, I have downloaded the latest version

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332437666, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5gUX21xnctOMVNN0XKdBadnga8Yks5smfyfgaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

I think it is not the reason, look at the following:

=== Constructing LSTM training data ===
which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no combine_lang_model in (./api)
[Tue Sep 26 21:17:16 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.LASR8IGnop/chi_sim/chi_sim.unicharset --script_dir ../langdata --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chieval --lang chi_sim

It has found it at /root/tesseract-src/tesseract-master/training/combine_lang_model。 Although,I will retry it later

Shreeshrii commented 7 years ago

works for me

Loaded 1484/1484 pages (1-1484) of document
/tmp/tmp.nuCxxWRRbN/chi_sim/chi_sim.STXihei.exp0.lstmf
Page 35
Loaded 1529/1529 pages (1-1529) of document
/tmp/tmp.nuCxxWRRbN/chi_sim/chi_sim.STXihei.exp0.lstmf

=== Constructing LSTM training data ===
[Wed Sep 27 15:10:45 DST 2017] /usr/local/bin/combine_lang_model
--input_unicharset /tmp/tmp.nuCxxWRRbN/chi_sim/chi_sim.un
icharset --script_dir ../langdata --words
../langdata/chi_sim/chi_sim.wordlist --numbers
../langdata/chi_sim/chi_sim.numbe
rs --puncs ../langdata/chi_sim/chi_sim.punc --output_dir
../tesstutorial/chi_sim --lang chi_sim
Loaded unicharset of size 5028 from file
/tmp/tmp.nuCxxWRRbN/chi_sim/chi_sim.unicharset
Setting unichar properties
Mirror 〖 of 〗 is not in unicharset
Setting script properties
Warning: properties incomplete for index 333 = ，
Config file is optional, continuing...
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.nuCxxWRRbN/chi_sim/chi_sim.STXihei.exp0.lstmf to
../tesstutorial/chi_sim

Completed training for language 'chi_sim'

Have you changed the training_text?

It could be the change related to whitespace code change...

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Sep 27, 2017 at 1:36 PM, ivanzz1001 notifications@github.com wrote:

I think it is not the reason, look at the following:

=== Constructing LSTM training data === which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no combine_lang_model in (./api) [Tue Sep 26 21:17:16 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.LASR8IGnop/chi_sim/chi_sim.unicharset --script_dir ../langdata --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chieval --lang chi_sim

It has found it at /root/tesseract-src/tesseract- master/training/combine_lang_model。 Although,I will retry it later

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332443146, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8lO8PINpcEfYUvJzR4bMcbxdwsLks5smgIVgaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

I don't change anything in the langdata directory, And my command is:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --exposures "0" \
  --fontlist "AR PL UKai CN" \
  "AR PL UKai HK" \
  "AR PL UKai TW" \
  "AR PL UKai TW MBE" \
  "AR PL UMing CN Semi-Light" \
  "AR PL UMing HK Semi-Light" \
  "AR PL UMing TW MBE Semi-Light" \
  "AR PL UMing TW Semi-Light" \
  "Arial Unicode MS" \
  "FangSong" \
  "KaiTi" \
  "LiSu" \
  "Microsoft YaHei" \
  "Microsoft YaHei Bold" \
  "NSimSun" \
  "Noto Sans SC" \
  "Noto Sans SC Bold" \
  "Noto Sans SC Heavy" \
  "Noto Sans SC Medium" \
  "Noto Sans SC Semi-Light" \
  "STFangsong" \
  "STKaiti" \
  "STSong" \
  "STXihei" \
  "STXinwei" \
  "STZhongsong" \
  "SimHei" \
  "SimSun" \
  "WenQuanYi Micro Hei" \
  "WenQuanYi Micro Hei Mono" \
  "WenQuanYi Zen Hei Medium" \
  "WenQuanYi Zen Hei Mono Medium" \
  "WenQuanYi Zen Hei Sharp Medium" \
  "YouYuan" \
  --output_dir ../tesstutorial/chieval \
  --overwrite

Shreeshrii commented 7 years ago

This is the command I used

training/tesstrain.sh \ --fonts_dir /mnt/c/Windows/Fonts \ --lang chi_sim \ --noextract_font_properties --linedata_only \ --exposures "0" \ --langdata_dir ../langdata \ --tessdata_dir ../tessdata \ --fontlist \ "STXihei" \ --output_dir ../tesstutorial/chi_sim

Please check whether it works with just this one font.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Sep 27, 2017 at 3:20 PM, ivanzz1001 notifications@github.com wrote:

I don't change anything in the langdata directory, And my command is:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --exposures "0" \ --fontlist "AR PL UKai CN" \ "AR PL UKai HK" \ "AR PL UKai TW" \ "AR PL UKai TW MBE" \ "AR PL UMing CN Semi-Light" \ "AR PL UMing HK Semi-Light" \ "AR PL UMing TW MBE Semi-Light" \ "AR PL UMing TW Semi-Light" \ "Arial Unicode MS" \ "FangSong" \ "KaiTi" \ "LiSu" \ "Microsoft YaHei" \ "Microsoft YaHei Bold" \ "NSimSun" \ "Noto Sans SC" \ "Noto Sans SC Bold" \ "Noto Sans SC Heavy" \ "Noto Sans SC Medium" \ "Noto Sans SC Semi-Light" \ "STFangsong" \ "STKaiti" \ "STSong" \ "STXihei" \ "STXinwei" \ "STZhongsong" \ "SimHei" \ "SimSun" \ "WenQuanYi Micro Hei" \ "WenQuanYi Micro Hei Mono" \ "WenQuanYi Zen Hei Medium" \ "WenQuanYi Zen Hei Mono Medium" \ "WenQuanYi Zen Hei Sharp Medium" \ "YouYuan" \ --output_dir ../tesstutorial/chieval \ --overwrite

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332469904, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8p8bHZ9xsJDYYh51ijOcQHuXYDNks5smhpogaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

I use just one font but it also has the same problem:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --exposures "0" \
  --fontlist "AR PL UKai CN" \
  --output_dir ../tesstutorial/chieval \
  --overwrite

Shreeshrii commented 7 years ago

Invalid format in radical table at line 0: 19886 3 23 6 3 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!!

Are you still getting the above error?

ivanzz1001 commented 7 years ago

Yes. I use you command:

training/tesstrain.sh \
 --fonts_dir /usr/share/fonts \
 --lang chi_sim \
 --noextract_font_properties  --linedata_only \
 --exposures "0" \
 --langdata_dir ../langdata \
 --tessdata_dir ./tessdata \
 --fontlist \
  "STXihei" \
  --output_dir ../tesstutorial/chi_sim

But It got the same problem.

=== Constructing LSTM training data ===
Creating new directory ../tesstutorial/chi_sim
which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no combine_lang_model in (./api)
[Wed Sep 27 19:03:32 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.unicharset --script_dir ../langdata --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim --lang chi_sim
Loaded unicharset of size 1923 from file /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.unicharset
Setting unichar properties
Mirror 「 of 」 is not in unicharset
Mirror { of } is not in unicharset
Mirror 〗 of 〖 is not in unicharset
Other case Z of z is not in unicharset
Setting script properties
Warning: properties incomplete for index 106 = ，
Config file is optional, continuing...
Null char=2
Invalid format in radical table at line 0: 19886 3 23 6 3
Creation of encoded unicharset failed!!
Error writing recoder!!
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!
Moving /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.STXihei.exp0.lstmf to ../tesstutorial/chi_sim

Completed training for language 'chi_sim'

which version have you use "--tessdata_dir ./tessdata"? the lastest tessdata_best?

Shreeshrii commented 7 years ago

Something seems wrong with your radical stroke file from langdata, please download again and try.

On 28-Sep-2017 7:35 AM, "ivanzz1001" notifications@github.com wrote:

Yes. I use you command:

training/tesstrain.sh \ --fonts_dir /usr/share/fonts \ --lang chi_sim \ --noextract_font_properties --linedata_only \ --exposures "0" \ --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --fontlist \ "STXihei" \ --output_dir ../tesstutorial/chi_sim

But It got the same problem.

=== Constructing LSTM training data === Creating new directory ../tesstutorial/chi_sim which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no combine_lang_model in (./api) [Wed Sep 27 19:03:32 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.unicharset --script_dir ../langdata --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim --lang chi_sim Loaded unicharset of size 1923 from file /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.unicharset Setting unichar properties Mirror 「 of 」 is not in unicharset Mirror { of } is not in unicharset Mirror 〗 of 〖 is not in unicharset Other case Z of z is not in unicharset Setting script properties Warning: properties incomplete for index 106 = ， Config file is optional, continuing... Null char=2 Invalid format in radical table at line 0: 19886 3 23 6 3 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! Moving /tmp/tmp.mv3dlQnYez/chi_sim/chi_sim.STXihei.exp0.lstmf to ../tesstutorial/chi_sim

Completed training for language 'chi_sim'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332706109, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1mDnBtQxwWF8SPKKd_KotLHfRoVks5smv72gaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

I have got the latest langdata, but it has the same problem. Could you send me you tesstrain.log?

ivanzz1001 commented 7 years ago

@Shreeshrii I trained eng, it also has the same problem:

 [root@localhost tesseract-master]# training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --exposures "0" \
>   --fontlist "DejaVu Serif" \
>   --tessdata_dir ../tessdata --output_dir ../tesstutorial/engeval

=== Starting training for language 'eng'
which: no text2image in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no text2image in (./api)
[Wed Sep 27 20:25:03 PDT 2017] /root/tesseract-src/tesseract-master/training/text2image --fonts_dir=/usr/share/fonts --font=DejaVu Serif --outputbase=/tmp/font_tmp.mopgCYHqsF/sample_text.txt --text=/tmp/font_tmp.mopgCYHqsF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.mopgCYHqsF
Rendered page 0 to file /tmp/font_tmp.mopgCYHqsF/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using DejaVu Serif
which: no text2image in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no text2image in (./api)
[Wed Sep 27 20:25:09 PDT 2017] /root/tesseract-src/tesseract-master/training/text2image --fontconfig_tmpdir=/tmp/font_tmp.mopgCYHqsF --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0 --max_pages=3 --font=DejaVu Serif --text=../langdata/eng/eng.training_text
Rendered page 0 to file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
which: no unicharset_extractor in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no unicharset_extractor in (./api)
[Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/training/unicharset_extractor --output_unicharset /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset --norm_mode 1 /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.box
Extracting unicharset from box file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.box
Other case É of é is not in unicharset
Wrote unicharset file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset
which: no set_unicharset_properties in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no set_unicharset_properties in (./api)
[Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/training/set_unicharset_properties -U /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset -O /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset -X /tmp/tmp.nqgcR2lnuC/eng/eng.xheights --script_dir=../langdata
Loaded unicharset of size 111 from file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Writing unicharset to file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata
which: no tesseract in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
[Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/api/tesseract /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Page 2
Loaded 51/51 pages (1-51) of document /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.lstmf

=== Constructing LSTM training data ===
which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
which: no combine_lang_model in (./api)
[Wed Sep 27 20:25:12 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ../tesstutorial/engeval --lang eng
Loaded unicharset of size 111 from file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config
Null char=2
Invalid format in radical table at line 0: 19886 3 23 6 3
Creation of encoded unicharset failed!!
Error writing recoder!!
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!
Moving /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.lstmf to ../tesstutorial/engeval

Completed training for language 'eng'

which version of tesseract do you use?

Shreeshrii commented 7 years ago

You are getting the error related to the following

https://github.com/tesseract-ocr/tesseract/blob/a2a72d7ca78a3bb3798a02a2ba5188e255c2a0f7/ccutil/unicharcompress.cpp#L79

https://github.com/tesseract-ocr/langdata/blob/master/radical-stroke.txt

The first line in radical-stroke.txt is 19886 3 23 6 3 and your error line says

Invalid format in radical table at line 0: 19886 3 23 6 3

So, there is a mismatch between the program that you are using and the data.

I am using the latest version of code from github ...

tesseract -v tesseract 4.00.00alpha leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

Found AVX Found SSE

Please check whether you have multiple versions/old versions of the program.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 28, 2017 at 8:56 AM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii I trained eng, it also has the same problem:

[root@localhost tesseract-master]# training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \

--noextract_font_properties --langdata_dir ../langdata \ --exposures "0" \ --fontlist "DejaVu Serif" \ --tessdata_dir ../tessdata --output_dir ../tesstutorial/engeval

=== Starting training for language 'eng' which: no text2image in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no text2image in (./api) [Wed Sep 27 20:25:03 PDT 2017] /root/tesseract-src/tesseract-master/training/text2image --fonts_dir=/usr/share/fonts --font=DejaVu Serif --outputbase=/tmp/font_tmp.mopgCYHqsF/sample_text.txt --text=/tmp/font_tmp.mopgCYHqsF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.mopgCYHqsF Rendered page 0 to file /tmp/font_tmp.mopgCYHqsF/sample_text.txt.tif

=== Phase I: Generating training images === Rendering using DejaVu Serif which: no text2image in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no text2image in (./api) [Wed Sep 27 20:25:09 PDT 2017] /root/tesseract-src/tesseract-master/training/text2image --fontconfig_tmpdir=/tmp/font_tmp.mopgCYHqsF --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0 --max_pages=3 --font=DejaVu Serif --text=../langdata/eng/eng.training_text Rendered page 0 to file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif Rendered page 1 to file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files === which: no unicharset_extractor in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no unicharset_extractor in (./api) [Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/training/unicharset_extractor --output_unicharset /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset --norm_mode 1 /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.box Extracting unicharset from box file /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.box Other case É of é is not in unicharset Wrote unicharset file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset which: no set_unicharset_properties in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no set_unicharset_properties in (./api) [Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/training/set_unicharset_properties -U /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset -O /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset -X /tmp/tmp.nqgcR2lnuC/eng/eng.xheights --script_dir=../langdata Loaded unicharset of size 111 from file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Warning: properties incomplete for index 25 = ~ Writing unicharset to file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset

=== Phase E: Generating lstmf files === Using TESSDATA_PREFIX=../tessdata which: no tesseract in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) [Wed Sep 27 20:25:10 PDT 2017] /root/tesseract-src/tesseract-master/api/tesseract /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.tif /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0 lstm.train Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Page 1 Page 2 Loaded 51/51 pages (1-51) of document /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.lstmf

=== Constructing LSTM training data === which: no combine_lang_model in (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) which: no combine_lang_model in (./api) [Wed Sep 27 20:25:12 PDT 2017] /root/tesseract-src/tesseract-master/training/combine_lang_model --input_unicharset /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ../tesstutorial/engeval --lang eng Loaded unicharset of size 111 from file /tmp/tmp.nqgcR2lnuC/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Config file is optional, continuing... Failed to read data from: ../langdata/eng/eng.config Null char=2 Invalid format in radical table at line 0: 19886 3 23 6 3 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! Moving /tmp/tmp.nqgcR2lnuC/eng/eng.DejaVu_Serif.exp0.lstmf to ../tesstutorial/engeval

Completed training for language 'eng'

which version of tesseract do you use?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332717337, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9ooGcLZou1zU4_KONg1VHqlQIuYks5smxH1gaJpZM4PfFg_ .

amitdo commented 7 years ago

Maybe you have two tesseracts in your system. one of them is older version than the other.

ivanzz1001 commented 7 years ago

I checked that I have only one version. And On a new system, I retried the process and met the same problem

amitdo commented 7 years ago

Strangely, tesseract parses 'radical-stroke.txt' for every language. https://github.com/tesseract-ocr/tesseract/blob/a2a72d7ca78a3bb3798a02a2ba5188e255c2a0f7/ccutil/unicharcompress.cpp#L98

amitdo commented 7 years ago

I checked that I have only one version.

How?

Try this: sudo find / -type f -name "libtesseract.so*"

I wonder why you put the git repo on /root.

ivanzz1001 commented 7 years ago

@Shreeshrii @amitdo I git clone the langdata directory to Linux and now it seems OK. The strange problem may be caused by: I git clone the langdata to Windows, and then use WinSCP to transfer the langdata to Linux(I can make sure the langdata is the latest as @Shreeshrii let me check -- I re-git-cloned the latest langdata)

Shreeshrii commented 7 years ago

OK, it might be related to Windows EOL vs Unix EOL. Maybe some WinSCP setting changes it.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 28, 2017 at 2:32 PM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii @amitdo https://github.com/amitdo I git clone the langdata directory to Linux and now it seems OK. The strange problem may be caused by: I git clone the langdata to Windows, and then use WinSCP to transfer the langdata to Linux(I can make sure the langdata is the latest as @Shreeshrii https://github.com/shreeshrii let me check -- I re-git-cloned the latest langdata)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332775119, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oylefg8kyUaUHb69luvSysm16aMxks5sm2C2gaJpZM4PfFg_ .

amitdo commented 7 years ago

https://help.github.com/articles/dealing-with-line-endings/

ivanzz1001 commented 7 years ago

@Shreeshrii When I use the following:

mkdir -p ~/tesstutorial/engoutput
training/lstmtraining --debug_interval 100 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

It needs "engtrain" and "engeval". What's the difference between the two? And I found that the commands using to generate them looks likely the same except the --fontlist option:

// engtrain
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

//eval data for the 'Impact' font:
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval

The "engtrain" likely will train all the fonts in /usr/share/fonts and the fonts assigned in training/language-specific.sh:

CHI_SIM_FONTS=( \
    "AR PL UKai CN" \
    "AR PL UMing Patched Light" \
    "Arial Unicode MS" \
    "Arial Unicode MS Bold" \
    "WenQuanYi Zen Hei Medium" \
    )

Here some of the CHI_SIM_FONTS that haven't been installed, so it cause some errors. Do I have to install all the CHI_SIM_FONTS? And I just want to know why we need "engtrain" and "engeval" ("chi_simtrain" and "chi_simeval") at the same time?

Shreeshrii commented 7 years ago

engtrain - lists all LSTMF files that will be used for doing LSTM training engeval - lists any LSTMF files that will be used for doing OCR evaluation while training

Usually, your fontlist for training will be larger than the ones used for eval

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Sep 29, 2017 at 3:51 PM, ivanzz1001 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii When I use the following:

mkdir -p ~/tesstutorial/engoutput training/lstmtraining --debug_interval 100 \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

It needs "engtrain" and "engeval". What's the difference between the two? And I found that the commands using to generate them looks likely the same except the --fontlist option:

// engtrain training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

//eval data for the 'Impact' font: training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval

The "engtrain" likely will train all the fonts in /usr/share/fonts and the fonts assigned in training/language-specific.sh:

CHI_SIM_FONTS=( \ "AR PL UKai CN" \ "AR PL UMing Patched Light" \ "Arial Unicode MS" \ "Arial Unicode MS Bold" \ "WenQuanYi Zen Hei Medium" \ )

Here some of the CHI_SIM_FONTS that haven't been installed, so it cause some errors. Do I have to install all the CHI_SIM_FONTS? And I just want to know why we need "engtrain" and "engeval" at the same time?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-333090078, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2r_vLqRjkZXVWgbGHE9s7dbZ8f-ks5snMSTgaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

Do I have to install all the fonts in CHI_SIM_FONTS(training/language-specific.sh):

CHI_SIM_FONTS=( \
    "AR PL UKai CN" \
    "AR PL UMing Patched Light" \
    "Arial Unicode MS" \
    "Arial Unicode MS Bold" \
    "WenQuanYi Zen Hei Medium" \
    )

Shreeshrii commented 7 years ago

Do I have to install all the fonts in CHI_SIM_FONTS(training/ language-specific.sh):

No, you can use whichever fonts that you want to train on. You can give multiple fonts as part of the command with --fontlist

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Sep 29, 2017 at 4:33 PM, ivanzz1001 notifications@github.com wrote:

Do I have to install all the fonts in CHI_SIM_FONTS(training/ language-specific.sh):

CHI_SIM_FONTS=( \ "AR PL UKai CN" \ "AR PL UMing Patched Light" \ "Arial Unicode MS" \ "Arial Unicode MS Bold" \ "WenQuanYi Zen Hei Medium" \ )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-333097787, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o69sN6dKBPFcAXnescSQKyDzcJo4ks5snM6fgaJpZM4PfFg_ .

ivanzz1001 commented 7 years ago

@Shreeshrii @amitdo When I trained the chi_sim, how do I set the net_spec flags?

--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'

what's the meaning?

And I execute the following command:

training/combine_tessdata -d tessdata/chi_sim.traineddata 
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335

It seems that the net_spec is [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1], Here the last value is "O1c1"? (not O1c111?). It seems that the last two "1" represent the network depth, so If I set it to O1c111, then the depth is 11, is the depth too deeper?

Is It OK that I ask the question here? or anywhere else is proper?

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs

Is It OK that I ask the question here? or anywhere else is proper?

The right place to ask this kind of question is our forum.

Shreeshrii commented 7 years ago

It seems that the net_spec is [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1], Here the last value is "O1c1"? (not O1c111?)

That last number is a dummy. It is overridden by the number of characters in the unicharset.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 11, 2017 at 5:44 PM, Amit D. notifications@github.com wrote:

https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs

Is It OK that I ask the question here? or anywhere else is proper?

The right place to ask this kind of question is our forum https://groups.google.com/d/forum/tesseract-ocr.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-335789942, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o3KfQoOIJxQ3lJ72LUdOMdfJJqOtks5srLEygaJpZM4PfFg_ .

Shreeshrii commented 6 years ago

@zdenop This issue can be closed.