manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.57k stars 187 forks source link

Inconsistent behaviour of QFontDatabase::hasFamily() and QFontDatabase::families(). #630

Closed edward-dauvergne closed 5 months ago

edward-dauvergne commented 1 year ago

While investigating font issues (issue #629), I noticed that there is a difference in the result from QFontDatabase::hasFamily() compared to the list returned by QFontDatabase::families(). Some background might be useful here:

To reproduce a small test case, I created a blank PNG file with:

convert -page A4 -size 595x842 xc:white empty.png

Then I manually created the following test.html file:

<!DOCTYPE html>
<html>
<head>
 <title>test.html</title>
 <meta charset="utf-8" /> 
 <meta name='ocr-system' content='tesseract 4.1.1' />
 <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
 <div title="bbox 0 0 595 842; image 'empty.png'; ppageno 1; rot 0; scan_res 100" class="ocr_page" id="page_1">
  <div title="bbox 69 121 396 206" class="ocr_carea" id="carea_1_1">
   <p title="bbox 69 121 396 206" class="ocr_par" id="par_1_1">
    <span title="baseline 0 0; bbox 69 121 396 154; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_1">
     <span title="bbox 69 121 396 154; x_font Nimbus Sans; x_fsize 26; x_wconf 100" class="ocrx_word" id="word_1_1" lang="en_US">Test &quot;Nimbus Sans&quot;:</span>
    </span>
    <span title="baseline 0 0; bbox 71 176 270 206; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_2">
     <span title="bbox 71 176 270 206; x_font Nimbus Sans; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_2" italic="0" lang="en_US">0123456789</span>
    </span>
   </p>
  </div>
  <div title="bbox 76 286 533 367" class="ocr_carea" id="carea_1_3">
   <p title="bbox 76 286 533 367" class="ocr_par" id="par_1_2">
    <span title="baseline 0 0; bbox 78 286 533 325; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_3">
     <span title="bbox 78 286 533 325; x_font Nimbus Mono L; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_3" italic="0" lang="en_US">Test &quot;Nimbus Mono L&quot;:</span>
    </span>
    <span title="baseline 0 0; bbox 76 344 296 367; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_4">
     <span title="bbox 76 344 296 367; x_font Nimbus Mono L; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_4" italic="0" lang="en_US">0123456789</span>
    </span>
   </p>
  </div>
  <div title="bbox 73 454 558 557" class="ocr_carea" id="carea_1_2">
   <p title="bbox 73 454 558 557" class="ocr_par" id="par_1_3">
    <span title="baseline 0 0; bbox 73 454 558 492; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_5">
     <span title="bbox 73 454 558 492; x_font OCRA [PfEd]; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_5" italic="0" lang="en_US">Test &quot;OCRA [PfEd]&quot;:</span>
    </span>
    <span title="baseline 0 0; bbox 76 526 335 557; x_ascenders 25; x_descenders 25; x_size 100" class="ocr_line" id="line_1_6">
     <span title="bbox 76 526 335 557; x_font OCRA [PfEd]; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_6" italic="0" lang="en_US">0123456789</span>
    </span>
   </p>
  </div>
 </div>
</body>
</html>

The GUI looks like this: Screenshot_GUI

Finally, I export to PDF using PoDoFo in PDF output mode, 300 dpi, grayscale. The result is: test.pdf.

Playing with the code in qt/src/hocr/HOCRPdfExporter.cc, I could see that the OCRA [PfEd] font was being replaced by the default font Nimbus Roman, as QFontDatabase::hasFamily() was returning false for this font. The Nimbus Mono L font was also switched for the default font, but for an unrelated FreeType reason:

CRITICAL: FreeType returned the error 35 when calling FT_Load_Sfnt_Table for font /usr/share/fonts/default/Type1/n022003l.pfb.

However checking QFontDatabase::families() shows that OCRA [PfEd] is in the QFontDatabase! Therefore I tried the following change:

 qt/src/hocr/HOCRPdfExporter.cc | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qt/src/hocr/HOCRPdfExporter.cc b/qt/src/hocr/HOCRPdfExporter.cc
index 1785b649..10ba00ed 100644
--- a/qt/src/hocr/HOCRPdfExporter.cc
+++ b/qt/src/hocr/HOCRPdfExporter.cc
@@ -242,7 +242,14 @@ HOCRQPainterPdfPrinter::HOCRQPainterPdfPrinter(QPainter* painter, const QFont& d

 void HOCRQPainterPdfPrinter::setFontFamily(const QString& family, bool bold, bool italic) {
    float curSize = m_curFont.pointSize();
-   if(m_fontDatabase.hasFamily(family)) {
+    const QStringList fontFamilies = m_fontDatabase.families();
+    bool hasFamily=false;
+    for (const QString &ifamily : fontFamilies) {
+        if (ifamily == family) {
+            hasFamily = true;
+        }
+    }
+    if(hasFamily) {
        m_curFont.setFamily(family);
    }  else {
        m_curFont = m_defaultFont;
@@ -561,7 +568,16 @@ PoDoFo::PdfFont* HOCRPoDoFoPdfPrinter::getFont(QString family, bool bold, bool i
    QString key = family + (bold ? "@bold" : "") + (italic ? "@italic" : "");
    auto it = m_fontCache.find(key);
    if(it == m_fontCache.end()) {
-       if(family.isEmpty() || !m_fontDatabase.hasFamily(family)) {
+        const QStringList fontFamilies = m_fontDatabase.families();
+        bool found=false;
+        for (const QString &ifamily : fontFamilies) {
+            std::cout << "Family: '" << ifamily.toStdString() << "'." << std::endl;
+            if (ifamily == family) {
+                std::cout << "Match!" << std::endl;
+                found = true;
+            }
+        }
+       if(family.isEmpty() || !found) {
            family = m_defaultFontFamily;
             std::cout << _("WARNING: Cannot find the font '%1' in the QFontDatabase, switching to the font '%2'.").arg(key).arg(family).toStdString() << std::endl;
             //QMessageBox::warning(MAIN, _("Missing Font"), _("WARNING: Cannot find the font '%1' in the QFontDatabase, switching to the font '%2'.").arg(key).arg(family));

This resulted in the OCRA [PfEd] font being recognised and embedded in the PDF output. Though there are still issues with the PDF: test_modified.pdf:

test_modified

According to the recent Qt docs for QFontDatabase, the hasFamily() function is not documented at all. Is this depreciated? Is is suffering from bit-rot? The families() function is however documented. Therefore should the code be switched to this function?

Cheers,

Edward

edward-dauvergne commented 1 year ago

For the record, the Nimbus Mono L error:

CRITICAL: FreeType returned the error 35 when calling FT_Load_Sfnt_Table for font /usr/share/fonts/default/Type1/n022003l.pfb.

can be worked around by switching to the Nimbus Mono PS font instead.

manisandro commented 1 year ago

I believe it should be safe to change

m_fontDatabase.hasFamily(family)

to

m_fontDatabase.families().contains(family)

Can you test and submit a PR?

edward-dauvergne commented 1 year ago

Sure! I'm on holidays for the next 2 weeks, but after that I can give it a go.

bobhairgrove commented 6 months ago

This was correct in Qt 5:

m_fontDatabase.families().contains(family)

Please note that QFontDatabase::families() is now a static function in Qt 6 (as it should be, IMHO):

QFontDatabase::families().contains(family)

would work in Qt 6.

manisandro commented 5 months ago

Changed in https://github.com/manisandro/gImageReader/commit/7755979dea503f29aad8f9284f27858aab4af159