General issue with multi-byte character handling

GoogleCodeExporter commented 9 years ago

I hit an isspace assertion while OCRing a test file, and saw the following bug 
in the InitializeRowInfo() function in ccmain/paragraphs.cpp:

  info->text = "";
  char *text = it.GetUTF8Text(RIL_TEXTLINE);
  int trailing_ws_idx = strlen(text);  // strip trailing space
  while (trailing_ws_idx > 0 &&
         text[trailing_ws_idx - 1] < 128 &&   // isspace() only takes ASCII
         isspace(text[trailing_ws_idx - 1]))
    trailing_ws_idx--;

This might only be noticeable on Windows when running a debug build, since 
that's when isspace throws an assert error. The problem is that it's entirely 
possible (I'd even say common) to have a line with 0xe2 0x80 0x9d 0x0a right at 
the end. This is according to [1] a "Unicode Character 'RIGHT DOUBLE QUOTATION 
MARK'
(U+201D)" as encoded in UTF-8, followed by a linefeed character.

I believe it's *impossible* to correctly walk a UTF-8 encoded string 
*backwards* like this?

As a general technique, when reading text from "outside" you should convert to 
unicode as early as possible and do all your string manipulations with unicode 
strings. You should only convert to UTF-8 right before you save back out to the 
outside world.

If this code is symptomatic of how tesseract handles strings internally, then a 
lot of work still needs to be done :(

[1] http://www.fileformat.info/info/unicode/char/201d/index.htm

Original issue reported on code.google.com by tomp2...@gmail.com on 8 Mar 2012 at 9:26

GoogleCodeExporter commented 9 years ago

Images which fail to OCR because of this issue:  leptonica's sample images: 
feyn.tif, keystone.png, lucasta-47.png, pageseg1.tif, pageseg2.tif, 
pageseg3.tif, pageseg4.tif, patent.png, rabi.png, and witten.tif (all in the 
progs directory).

Original comment by tomp2...@gmail.com on 17 Mar 2012 at 10:00

GoogleCodeExporter commented 9 years ago

Fixed in r706.

The code skips non-ASCII characters -- but I'd never tested it with your 
compiler, sorry.  Ours treats char as unsigned.

Original comment by david.e...@gmail.com on 20 Mar 2012 at 8:20

Changed state: Fixed

mithilesh1125 / tesseract-ocr

General issue with multi-byte character handling #645