tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.56k stars 9.44k forks source link

Method WordFontAttributes does not work #1074

Closed zikcheng closed 2 years ago

zikcheng commented 7 years ago

Environment

Current Behavior:

Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif I first met this problem when I use tesserocr [tesserocr#68] .(https://github.com/sirfz/tesserocr/issues/68)

Expected Behavior:

With method WordFontAttributes we can get correct font attributes of recognized words.

amitdo commented 7 years ago

The new LSTM engine does not support this feature and probably won't support it any time soon.

phildrip commented 7 years ago

Is there an alternative way to get font sizing etc? Do you mean that just this method won't be supported, or the feature in general?

amitdo commented 7 years ago

Is there an alternative way to get font sizing etc?

You can still use --oem 0 with traineddata from here: https://github.com/tesseract-ocr/tessdata. Note that the traineddata in the 'best' folder won't work with --oem 0.

amitdo commented 7 years ago

Do you mean that just this method won't be supported, or the feature in general?

I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.

Important note: I'm a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.

phildrip commented 7 years ago

Thanks for the reply! It looks like the old ocr engine is going to be removed, though (issue #707)... And does using OcrEngineMode 0 mean the behaviour is the same as v3?

What I'm getting to is:

  1. I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?
  2. If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?
  3. Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

Thanks again for the help!

amitdo commented 7 years ago

It looks like the old ocr engine is going to be removed, though (issue #707)...

It's not known when exactly it will be removed. Until then you can still use it.

And does using OcrEngineMode 0 mean the behaviour is the same as v3?

It's basically the same as 3.05.01.

I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?

There is no method in the API to get font sizes for the lstm engine.

If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?

Probably yes.

Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

The accuracy should be the same.

amitdo commented 7 years ago

The relative font size for a textline can be estimated by calculating the xheight of the line and compare it to the median xheight of the other textlines in the page.

phildrip commented 7 years ago

Ok, thanks for the info :+1:

amitdo commented 7 years ago

@phildrip,

I looked at the relevant code again, and I think the font size functionality (but not font name and properties like is_bold) can be restored when using the lstm engine.

I will provide further details (and probably send a PR) in the upcoming days.

phildrip commented 7 years ago

That's great news, thanks!

amitdo commented 7 years ago
// Returns the font attributes of the current word. If iterating at a higher
// level object than words, eg textlines, then this will return the
// attributes of the first word in that textline.
// The actual return value is a string representing a font name. It points
// to an internal table and SHOULD NOT BE DELETED. Lifespan is the same as
// the iterator itself, ie rendered invalid by various members of
// TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI.
// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(bool* is_bold,
                                                  bool* is_italic,
                                                  bool* is_underlined,
                                                  bool* is_monospace,
                                                  bool* is_serif,
                                                  bool* is_smallcaps,
                                                  int* pointsize,
                                                  int* font_id) const {
  if (it_->word() == NULL) return NULL;  // Already at the end!
  if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
  }
  const FontInfo& font_info = *it_->word()->fontinfo;
  *font_id = font_info.universal_id;
  *is_bold = font_info.is_bold();
  *is_italic = font_info.is_italic();
  *is_underlined = false;  // TODO(rays) fix this!
  *is_monospace = font_info.is_fixed_pitch();
  *is_serif = font_info.is_serif();
  *is_smallcaps = it_->word()->small_caps;
  float row_height = it_->row()->row->x_height() +
      it_->row()->row->ascenders() - it_->row()->row->descenders();
  // Convert from pixels to printers points.
  *pointsize = scaled_yres_ > 0
      ? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
      : 0;

  return font_info.name;
}

The problem:

if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
}

With the LSTM engine the it_->word()->fontinfo will always be NULL. So pointsize has no chance to be calculated.

pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().

There is another function where you can get row height.

void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
                                      float* ascenders) const {
  *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
                it_->row()->row->descenders();
  *descenders = it_->row()->row->descenders();
  *ascenders = it_->row()->row->ascenders();
}

I think pointsize calculation should be moved into this function.

amitdo commented 7 years ago

@zdenop, @stweil Do you have any comment?

zdenop commented 7 years ago

At the moment I have a limited internet access. If you make a pull request I can merge it ;.-)

stweil commented 7 years ago

Although my current main focus is getting the text from images, there are also important use cases where text attributes are important as well. As I understand your comments, currently the new LSTM recognizer does not support the method WordFontAttributes, so it is not possible to get text attributes with that recognizer. Adding support for the font size recognition with LSTM seems to be feasible, but other text attributes like for example bold or italic are desirable, too.

theraysmith commented 7 years ago

It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible. All these attributes would require changes to the rendering pipeline, and datapath for the ground truth. Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that. I wouldn't rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine. I have a bunch of updates to push, which I didn't quite get to before my office move...

stweil commented 7 years ago

Thank you for this clarification, Ray.

amitdo commented 7 years ago

Thank you for this clarification, Ray.

+1

Ray, In the meantime, can I fix the font size issue? https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-326762911

theraysmith commented 7 years ago

Yes of course. Just re-order the code in WordFontAttributes.

amitdo commented 7 years ago

Yes of course. Just re-order the code in WordFontAttributes.

That was my first thought, but it seems to give you font size in the line level, while the name of the method implies otherwise (WordFontAttributese), so I suggested to move pointsize to the RowAttributes() method.

Shreeshrii commented 7 years ago

It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible.

You could also take bold/italic into account when people use multiple languages for recognition, because many times the words in the additional language may be emphasized with bold or italics..

For an example, see the image in https://github.com/tesseract-ocr/langdata/pull/4#issuecomment-327760269 where Roman transliteration of Hindi is italicized with English text.

Shreeshrii commented 7 years ago

it seems to give you font size in the line level

While that would work in most cases, what of an extreme case of text of different size being on the same line - eg. http://www.teach-ict.com/programming/html/intro/step17a.jpg

theraysmith commented 7 years ago

That has always been a problem. The old code would often output garbage. The LSTM engine will split the line at such words and recognize them separately, pasting the results back together. It doesn\t give an estimate of the x-height though. The overall accuracy on such images is better though.

Shreeshrii commented 7 years ago

@theraysmith Please see related issue https://github.com/tesseract-ocr/tesseract/issues/538

regarding recognition problems when an image has many different font sizes in it.

vtigranv commented 7 years ago

+1

troplin commented 6 years ago

IMO the current state of this method is not very satisfying. In version 3, it was clear that no information was available if the method returned NULL.

Now in version 4 with LSTM, the method returns NULL, but the font size is still computed. The rest of the properties currently seem to be set to true unconditionally. It's not possible to find out, if those are actually correct or just garbage.

At least the method should not change the values, if the information is not available.

amitdo commented 6 years ago

It's not possible to find out, if those are actually correct or just garbage.

What's the value of font_id?

troplin commented 6 years ago

font_id is -1. I realize that I can probably just assume that the font size is always correct and the rest only if the method returns something != NULL or if font_id != -1.

But that's just implicit knowledge and not at all clear from the signature. And going forward, if e.g. the bold property is correctly recognized too in a future version, there's no way to recognize that. I'd very much prefer an API where it is inherently clear which properties are meaningful and which aren't, without relying on implicit knowledge.

amitdo commented 6 years ago

See also https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-327831585

hoangaeye commented 4 years ago

Do we have a solution for this?

amitdo commented 4 years ago

As you can see the issue is still open.

It's unknown when font name, bold and italic identification will be supported for the LSTM engine.

hoangaeye commented 4 years ago

is there another method or package that can determine font size?

amitdo commented 4 years ago

font size is supported:

https://github.com/tesseract-ocr/tesseract/pull/1173

amitdo commented 4 years ago

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/ltrresultiterator.cpp#L164

pdiwadkar commented 3 years ago

Is this issue still open?

amitdo commented 3 years ago

Is this issue still open?

https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-576631778

shubham1206agra commented 3 years ago

Can you provide some partial solution to this, like access only font size as I think there is support. Please

amitdo commented 3 years ago

https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-576806220

coco2121 commented 3 years ago

Hello! Is this issue still open? I need to get some font properties from scanned pdf like when text is bold or underlined. WordFontAttribute is returning None, any suggestion on what I can use to get these properties?

Thanks!

kalai2033 commented 2 years ago

@coco2121 Hi, did you manage to find any solutions? I am also trying to solve exactly the same problem as yours?

amitdo commented 2 years ago

The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned).

However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use --oem 0 (It might also work with --oem 3, not sure).

amitdo commented 2 years ago

If you still have a question about this topic after reading my previous comment, please use our forum.

I locked this issue because people keep asking here the same questions and I answered the questions multiple times.