jcallinan / tesseractdotnet

Automatically exported from code.google.com/p/tesseractdotnet
0 stars 0 forks source link

Problem with french characters #7

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. use the application with french text
2.
3.

What is the expected output? What do you see instead?
special characters éèà...  not recognized correctly

What version of the product are you using? On what operating system?
last version - windows

Please provide any additional information below.
the bug could be corrected in tesseractenginewrapper.cpp :

static wchar_t *make_unicode_string(const char *utf8)
{
  int size = 0, out_index = 0;
  wchar_t *out;

  /* first calculate the size of the target string */
  int used = 0;
  int utf8_len = strlen(utf8);
  while (used < utf8_len) {
    int step = UNICHAR::utf8_step(utf8 + used);
    if (step == 0)
      break;
    used += step;
    ++size;
  }

  out = (wchar_t *) malloc((size + 1) * sizeof(wchar_t));
  if (out == NULL)
      return NULL;

  /* now convert to Unicode */
  used = 0;
  while (used < utf8_len) {
    int step = UNICHAR::utf8_step(utf8 + used);
    if (step == 0)
      break;
    UNICHAR ch(utf8 + used, step);
    out[out_index++] = ch.first_uni();
    used += step;
  }
  out[out_index] = 0;

  return out;
}

System::Collections::Generic::List<Word*>* 
TesseractProcessor::RetriveResultDetail()
{
    if (!_doMonitor || _monitorInstance == null)
        return null;

    System::Collections::Generic::List<Word*>* wordList = null;

    ETEXT_DESC* monitor = null;
    ETEXT_DESC* head = null;
    Word* currentWord = null;

    try
    {
        monitor = (ETEXT_DESC*)_monitorInstance.ToPointer();
        head = &monitor[1];

        int lineIndex=0;        
        int lineIdx = 0;
        int nChars = head->count;
        int i = 0;
        int j;
        while (i < nChars)
        {
            EANYCODE_CHAR* ch = &(head + i)->text[0];

            if (ch->blanks > 0)
            {   /*new word condition meets*/
                if (currentWord != null)
                    wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);

                currentWord = null; // reset current word
            }

            if (currentWord != null && 
                (ch->left <= currentWord->Left || ch->top >= currentWord->Bottom))              
            {   /*new line condition meets*/
                wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);

                lineIdx++;

                currentWord = null; // reset current word
            }

            if (currentWord == null)
            {   /*create new word*/
                currentWord = new Word();

                currentWord->LineIndex = lineIdx;

                currentWord->FontIndex = ch->font_index;
                currentWord->PointSize = ch->point_size;
                currentWord->Formating = ch->formatting;
            }

            unsigned char unistr[24]; 

            for (j = i; j < nChars; j++) 
            { 
                const EANYCODE_CHAR* unich = &(head + j)->text[0]; 
                if (ch->left != unich->left || ch->right != unich->right || 
                    ch->top != unich->top || ch->bottom != unich->bottom) 
                    break; 
                unistr[j - i] = static_cast<unsigned char>(unich->char_code); 
            }
            unistr[j - i] = '\0'; 
            wchar_t *utf16ch=make_unicode_string(reinterpret_cast<const char*>(unistr));

            Character* c = new Character(
                static_cast<char>(*utf16ch), 
                ch->confidence,
                ch->left, ch->top, ch->right, ch->bottom);

            /* update current word */
            currentWord->CharList->Add(c);

            System::String* sc = new String(*utf16ch, 1);
            currentWord->Text = System::String::Format(
                "{0}{1}", currentWord->Text->ToString(), sc);

            free(utf16ch);

            currentWord->Left = Math::Min(currentWord->Left, (int)ch->left);
            currentWord->Top = Math::Min(currentWord->Top, (int)ch->top);
            currentWord->Right = Math::Max(currentWord->Right, (int)ch->right);
            currentWord->Bottom = Math::Max(currentWord->Bottom, (int)ch->bottom);

            currentWord->Confidence += ch->confidence;

            i=j; /*go to next char*/
        } /* end while */

        if (currentWord != null)
            wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);
    }
    catch (System::Exception* exp)
    {
        throw exp;
    }
    __finally
    {
        currentWord = null;
        head = null;
        monitor = null;
    }

    return wordList;
}

Original issue reported on code.google.com by Domdo...@gmail.com on 25 May 2011 at 4:11

GoogleCodeExporter commented 8 years ago
It's better to use API conversion functions, if possible, as follows:

/**
 *  Converts UTF-8 to Unicode.
 *
 * @param str     Source string in UTF-8 encoding
 * @return        Unicode string
 */
public String^ ConvertUTF8(String^ str)
{ 
    array<Byte>^ aBytes = Encoding::Default->GetBytes(str);
    return Encoding::UTF8->GetString(aBytes);
}

Original comment by nguyen...@gmail.com on 29 May 2011 at 3:25

GoogleCodeExporter commented 8 years ago
Hello,
I tried but it did not get the good values...
That's why I took the piece of code for the transformation elsewhere in link 
with the tesseract project.

Original comment by Domdo...@gmail.com on 29 May 2011 at 3:29

GoogleCodeExporter commented 8 years ago
It's probably because of the ending null character. .NET string does not need 
it, I think.

Original comment by nguyen...@gmail.com on 29 May 2011 at 6:21

GoogleCodeExporter commented 8 years ago
I used tesseract3.dll in my web service application. When I publish web site, a 
error appeared "The specified module could not be found (Exception from HRESULT 
0x8007007E)"
Please help me to resolve this problem.

Original comment by phamphih...@gmail.com on 2 Jun 2011 at 3:17

GoogleCodeExporter commented 8 years ago
You've to be sure that the Leptonica dll's are in your directory : lept*.dll

Original comment by Domdo...@gmail.com on 2 Jun 2011 at 8:06