Closed zikcheng closed 2 years ago
The new LSTM engine does not support this feature and probably won't support it any time soon.
Is there an alternative way to get font sizing etc? Do you mean that just this method won't be supported, or the feature in general?
Is there an alternative way to get font sizing etc?
You can still use --oem 0 with traineddata from here: https://github.com/tesseract-ocr/tessdata. Note that the traineddata in the 'best' folder won't work with --oem 0.
Do you mean that just this method won't be supported, or the feature in general?
I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.
Important note: I'm a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.
Thanks for the reply! It looks like the old ocr engine is going to be removed, though (issue #707)... And does using OcrEngineMode 0
mean the behaviour is the same as v3?
What I'm getting to is:
OcrEngineMode 0
to be able to get this info, will that be removed from v4 at a later date?OcrEngineMode 0
vs v3.05?Thanks again for the help!
It looks like the old ocr engine is going to be removed, though (issue #707)...
It's not known when exactly it will be removed. Until then you can still use it.
And does using OcrEngineMode 0 mean the behaviour is the same as v3?
It's basically the same as 3.05.01.
I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?
There is no method in the API to get font sizes for the lstm engine.
If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?
Probably yes.
Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?
The accuracy should be the same.
The relative font size for a textline can be estimated by calculating the xheight of the line and compare it to the median xheight of the other textlines in the page.
Ok, thanks for the info :+1:
@phildrip,
I looked at the relevant code again, and I think the font size functionality (but not font name and properties like is_bold) can be restored when using the lstm engine.
I will provide further details (and probably send a PR) in the upcoming days.
That's great news, thanks!
// Returns the font attributes of the current word. If iterating at a higher
// level object than words, eg textlines, then this will return the
// attributes of the first word in that textline.
// The actual return value is a string representing a font name. It points
// to an internal table and SHOULD NOT BE DELETED. Lifespan is the same as
// the iterator itself, ie rendered invalid by various members of
// TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI.
// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(bool* is_bold,
bool* is_italic,
bool* is_underlined,
bool* is_monospace,
bool* is_serif,
bool* is_smallcaps,
int* pointsize,
int* font_id) const {
if (it_->word() == NULL) return NULL; // Already at the end!
if (it_->word()->fontinfo == NULL) {
*font_id = -1;
return NULL; // No font information.
}
const FontInfo& font_info = *it_->word()->fontinfo;
*font_id = font_info.universal_id;
*is_bold = font_info.is_bold();
*is_italic = font_info.is_italic();
*is_underlined = false; // TODO(rays) fix this!
*is_monospace = font_info.is_fixed_pitch();
*is_serif = font_info.is_serif();
*is_smallcaps = it_->word()->small_caps;
float row_height = it_->row()->row->x_height() +
it_->row()->row->ascenders() - it_->row()->row->descenders();
// Convert from pixels to printers points.
*pointsize = scaled_yres_ > 0
? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
: 0;
return font_info.name;
}
The problem:
if (it_->word()->fontinfo == NULL) {
*font_id = -1;
return NULL; // No font information.
}
With the LSTM engine the it_->word()->fontinfo
will always be NULL
.
So pointsize has no chance to be calculated.
pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().
There is another function where you can get row height.
void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
float* ascenders) const {
*row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
it_->row()->row->descenders();
*descenders = it_->row()->row->descenders();
*ascenders = it_->row()->row->ascenders();
}
I think pointsize calculation should be moved into this function.
@zdenop, @stweil Do you have any comment?
At the moment I have a limited internet access. If you make a pull request I can merge it ;.-)
Although my current main focus is getting the text from images, there are also important use cases where text attributes are important as well. As I understand your comments, currently the new LSTM recognizer does not support the method WordFontAttributes
, so it is not possible to get text attributes with that recognizer. Adding support for the font size recognition with LSTM seems to be feasible, but other text attributes like for example bold or italic are desirable, too.
It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible. All these attributes would require changes to the rendering pipeline, and datapath for the ground truth. Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that. I wouldn't rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine. I have a bunch of updates to push, which I didn't quite get to before my office move...
Thank you for this clarification, Ray.
Thank you for this clarification, Ray.
+1
Ray, In the meantime, can I fix the font size issue? https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-326762911
Yes of course. Just re-order the code in WordFontAttributes.
Yes of course. Just re-order the code in WordFontAttributes.
That was my first thought, but it seems to give you font size in the line level, while the name of the method implies otherwise (WordFontAttributese), so I suggested to move pointsize to the RowAttributes() method.
It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible.
You could also take bold/italic into account when people use multiple languages for recognition, because many times the words in the additional language may be emphasized with bold or italics..
For an example, see the image in https://github.com/tesseract-ocr/langdata/pull/4#issuecomment-327760269 where Roman transliteration of Hindi is italicized with English text.
it seems to give you font size in the line level
While that would work in most cases, what of an extreme case of text of different size being on the same line - eg. http://www.teach-ict.com/programming/html/intro/step17a.jpg
That has always been a problem. The old code would often output garbage. The LSTM engine will split the line at such words and recognize them separately, pasting the results back together. It doesn\t give an estimate of the x-height though. The overall accuracy on such images is better though.
@theraysmith Please see related issue https://github.com/tesseract-ocr/tesseract/issues/538
regarding recognition problems when an image has many different font sizes in it.
+1
IMO the current state of this method is not very satisfying.
In version 3, it was clear that no information was available if the method returned NULL
.
Now in version 4 with LSTM, the method returns NULL
, but the font size is still computed. The rest of the properties currently seem to be set to true
unconditionally.
It's not possible to find out, if those are actually correct or just garbage.
At least the method should not change the values, if the information is not available.
It's not possible to find out, if those are actually correct or just garbage.
What's the value of font_id
?
font_id
is -1
.
I realize that I can probably just assume that the font size is always correct and the rest only if the method returns something != NULL
or if font_id != -1
.
But that's just implicit knowledge and not at all clear from the signature. And going forward, if e.g. the bold property is correctly recognized too in a future version, there's no way to recognize that. I'd very much prefer an API where it is inherently clear which properties are meaningful and which aren't, without relying on implicit knowledge.
Do we have a solution for this?
As you can see the issue is still open.
It's unknown when font name, bold and italic identification will be supported for the LSTM engine.
is there another method or package that can determine font size?
font size is supported:
Is this issue still open?
Is this issue still open?
https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-576631778
Can you provide some partial solution to this, like access only font size as I think there is support. Please
Hello! Is this issue still open? I need to get some font properties from scanned pdf like when text is bold or underlined. WordFontAttribute is returning None, any suggestion on what I can use to get these properties?
Thanks!
@coco2121 Hi, did you manage to find any solutions? I am also trying to solve exactly the same problem as yours?
The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned).
However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use --oem 0
(It might also work with --oem 3
, not sure).
Environment
Current Behavior:
Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif I first met this problem when I use tesserocr [tesserocr#68] .(https://github.com/sirfz/tesserocr/issues/68)
Expected Behavior:
With method WordFontAttributes we can get correct font attributes of recognized words.