manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

Possibility of using a newer Tesseract version (5.x) on (Ubuntu) Linux? #576

Closed FlyingFathead closed 2 years ago

FlyingFathead commented 2 years ago

Hi there, and first of all, thanks for the highly useful software you've created! :)

gImageReader to this day is my favorite go-to tool for Tesseract OCR reading, but sadly, even when building it from source on Ubuntu 21.14, it seems to be using Tesseract version 4.1.1 according to the software's "About" window.

Since I've noticed far better OCR accuracy in Tesseract 5.x that I have compiled from source and added to my system as the go-to Tesseract version, I would like to ask if it's in any way possible to use my newer Tesseract 5.x (git) version with gImageReader?

My tesseract --version shows the latest git version (5.0.1-43 as of this) + latest leptonica libraries are installed, but sadly, gImageReader seems to still stick to Tesseract 4.1.1 regardless.

Thanks once again, and all info on this is highly appreciated.

manisandro commented 2 years ago

Hi, you need to recompile gImageReader against tesseract 5 - gImageReader will always display the tesseract version it was compiled against.

FlyingFathead commented 2 years ago

Hi, you need to recompile gImageReader against tesseract 5 - gImageReader will always display the tesseract version it was compiled against.

Thanks for the quick reply & clarification! However, now I ran to another problem on recompiling in Ubuntu 21.10, the gtk build goes to 100% with occasional depreciation warnings, but when I get to 100%, this happens:

[100%] Linking CXX executable gimagereader-gtk
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Config.cc.o: in function `Config::getAvailableLanguages()':
Config.cc:(.text+0x2eaa): undefined reference to `tesseract::TessBaseAPI::GetAvailableLanguagesAsVector(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*) const'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Config.cc.o: in function `tesseract::TessBaseAPI::Init(char const*, char const*)':
Config.cc:(.text._ZN9tesseract11TessBaseAPI4InitEPKcS2_[_ZN9tesseract11TessBaseAPI4InitEPKcS2_]+0x43): undefined reference to `tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, bool)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognize(std::vector<int, std::allocator<int> > const&, bool)::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x2616): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeImage(Cairo::RefPtr<Cairo::ImageSurface> const&, Recognizer::OutputDestination)::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x316c): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeImage(Cairo::RefPtr<Cairo::ImageSurface> const&, Recognizer::OutputDestination)::{lambda()#2}::operator()() const':
Recognizer.cc:(.text+0x3219): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeBatch()::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x40ea): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/gimagereader.dir/build.make:718: gimagereader-gtk] Error 1
make[1]: *** [CMakeFiles/Makefile2:182: CMakeFiles/gimagereader.dir/all] Error 2
make: *** [Makefile:149: all] Error 2

I'm pretty sure I've successfully compiled gImageReader on my other Linux desktop machines running Ubuntu 64-bit (amd64). I tried googling for the error messages but found little to no information on what to do with those error messages. Tesseract is compiled from a git source and it works without any issues. All help is kindly appreciated. Thanks!

manisandro commented 2 years ago

Probably -ltesseract missing in your linking command, try make VERBOSE=1 to see the full commands and check whether tesseract appears as link library.

FlyingFathead commented 2 years ago

Probably -ltesseract missing in your linking command, try make VERBOSE=1 to see the full commands and check whether tesseract appears as link library.

OK - I tried: make VERBOSE=1

End result was this:

[100%] Linking CXX executable gimagereader-gtk
/usr/bin/cmake -E cmake_link_script CMakeFiles/gimagereader.dir/link.txt --verbose=1
/usr/bin/c++  -fopenmp CMakeFiles/gimagereader.dir/common/CCITTFax4Encoder.cc.o CMakeFiles/gimagereader.dir/common/PaperSize.cc.o CMakeFiles/gimagereader.dir/gtk/src/Acquirer.cc.o CMakeFiles/gimagereader.dir/gtk/src/Config.cc.o CMakeFiles/gimagereader.dir/gtk/src/ConfigSettings.cc.o CMakeFiles/gimagereader.dir/gtk/src/CrashHandler.cc.o CMakeFiles/gimagereader.dir/gtk/src/DisplayRenderer.cc.o CMakeFiles/gimagereader.dir/gtk/src/Displayer.cc.o CMakeFiles/gimagereader.dir/gtk/src/DisplayerToolSelect.cc.o CMakeFiles/gimagereader.dir/gtk/src/DjVuDocument.cc.o CMakeFiles/gimagereader.dir/gtk/src/FileDialogs.cc.o CMakeFiles/gimagereader.dir/gtk/src/FileTreeModel.cc.o CMakeFiles/gimagereader.dir/gtk/src/FontComboBox.cc.o CMakeFiles/gimagereader.dir/gtk/src/Image.cc.o CMakeFiles/gimagereader.dir/gtk/src/MainWindow.cc.o CMakeFiles/gimagereader.dir/gtk/src/OutputBuffer.cc.o CMakeFiles/gimagereader.dir/gtk/src/OutputEditorText.cc.o CMakeFiles/gimagereader.dir/gtk/src/RecognitionMenu.cc.o CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o CMakeFiles/gimagereader.dir/gtk/src/SearchReplaceFrame.cc.o CMakeFiles/gimagereader.dir/gtk/src/SourceManager.cc.o CMakeFiles/gimagereader.dir/gtk/src/SubstitutionsManager.cc.o CMakeFiles/gimagereader.dir/gtk/src/TessdataManager.cc.o CMakeFiles/gimagereader.dir/gtk/src/Utils.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/DisplayerToolHOCR.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRBatchExportDialog.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRDocument.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCROdtExporter.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRPdfExportWidget.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRPdfExporter.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRSpellChecker.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/HOCRTextExporter.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/OutputEditorHOCR.cc.o CMakeFiles/gimagereader.dir/gtk/src/hocr/XmlUtils.cc.o CMakeFiles/gimagereader.dir/gtk/src/main.cc.o CMakeFiles/gimagereader.dir/gtk/src/scanner/ScannerSane.cc.o CMakeFiles/gimagereader.dir/gimagereader.gresource.c.o -o gimagereader-gtk  -ltesseract -larchive -lgtkmm-3.0 -latkmm-1.6 -lgdkmm-3.0 -lgiomm-2.4 -lgtk-3 -lgdk-3 -latk-1.0 -lcairo-gobject -lgio-2.0 -lpangomm-1.4 -lglibmm-2.4 -lcairomm-1.0 -lsigc-2.0 -lpangocairo-1.0 -lpango-1.0 -lharfbuzz -lcairo -lgdk_pixbuf-2.0 -lgobject-2.0 -lglib-2.0 -lgtksourceviewmm-3.0 -lgtkmm-3.0 -latkmm-1.6 -lgdkmm-3.0 -lgiomm-2.4 -lpangomm-1.4 -lglibmm-2.4 -lcairomm-1.0 -lsigc-2.0 -lgtksourceview-3.0 -lgtk-3 -lgdk-3 -lpangocairo-1.0 -lpango-1.0 -lharfbuzz -latk-1.0 -lcairo-gobject -lcairo -lgdk_pixbuf-2.0 -lgio-2.0 -lgobject-2.0 -lglib-2.0 -lgtkspellmm-3.0 -lgtkspell3-3 -lenchant-2 -lgtkmm-3.0 -latkmm-1.6 -lgdkmm-3.0 -lgiomm-2.4 -lgtk-3 -lgdk-3 -latk-1.0 -lcairo-gobject -lgio-2.0 -lpangomm-1.4 -lglibmm-2.4 -lcairomm-1.0 -lsigc-2.0 -lpangocairo-1.0 -lpango-1.0 -lharfbuzz -lcairo -lgdk_pixbuf-2.0 -lgobject-2.0 -lglib-2.0 -lcairomm-1.0 -lcairo -lsigc-2.0 -lpangomm-1.4 -lglibmm-2.4 -lcairomm-1.0 -lsigc-2.0 -lpangocairo-1.0 -lpango-1.0 -lgobject-2.0 -lglib-2.0 -lharfbuzz -lcairo -lpoppler-glib -lgobject-2.0 -lglib-2.0 -lcairo -ljson-glib-1.0 -lgio-2.0 -lgobject-2.0 -lglib-2.0 -lxml++-2.6 -lxml2 -lglibmm-2.4 -lgobject-2.0 -lglib-2.0 -lsigc-2.0 /usr/lib/x86_64-linux-gnu/libjpeg.so -lfontconfig -lfreetype -lzip -lsane -ldjvulibre -lenchant-2 -lpodofo -ldl -lgtkmm-3.0 -latkmm-1.6 -lgdkmm-3.0 -lgiomm-2.4 -lgtk-3 -lgdk-3 -latk-1.0 -lcairo-gobject -lgio-2.0 -lpangomm-1.4 -lglibmm-2.4 -lcairomm-1.0 -lsigc-2.0 -lpangocairo-1.0 -lpango-1.0 -lharfbuzz -lcairo -lgdk_pixbuf-2.0 -lgobject-2.0 -lglib-2.0 -lgtksourceviewmm-3.0 -lgtksourceview-3.0 -lgtkspellmm-3.0 -lgtkspell3-3 -lpoppler-glib -ljson-glib-1.0 -lxml++-2.6 -lxml2 /usr/lib/x86_64-linux-gnu/libjpeg.so -lfontconfig -lfreetype -lzip -lsane -ldjvulibre -lpodofo -ldl
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Config.cc.o: in function `Config::getAvailableLanguages()':
Config.cc:(.text+0x2eaa): undefined reference to `tesseract::TessBaseAPI::GetAvailableLanguagesAsVector(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*) const'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Config.cc.o: in function `tesseract::TessBaseAPI::Init(char const*, char const*)':
Config.cc:(.text._ZN9tesseract11TessBaseAPI4InitEPKcS2_[_ZN9tesseract11TessBaseAPI4InitEPKcS2_]+0x43): undefined reference to `tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, bool)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognize(std::vector<int, std::allocator<int> > const&, bool)::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x2616): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeImage(Cairo::RefPtr<Cairo::ImageSurface> const&, Recognizer::OutputDestination)::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x316c): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeImage(Cairo::RefPtr<Cairo::ImageSurface> const&, Recognizer::OutputDestination)::{lambda()#2}::operator()() const':
Recognizer.cc:(.text+0x3219): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
/usr/bin/ld: CMakeFiles/gimagereader.dir/gtk/src/Recognizer.cc.o: in function `Recognizer::recognizeBatch()::{lambda()#1}::operator()() const':
Recognizer.cc:(.text+0x40ea): undefined reference to `tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*)'
collect2: error: ld returned 1 exit status

So, the -ltesseract seems to be visibly listed there, but other than that I've got no clue what's causing the error. Something missing since I've built Tesseract and Leptonica libraries from source as well? Thanks so much for your help so far.

manisandro commented 2 years ago

Are you linking against the correct tesseract library? In particular, the one matching the included headers?

FlyingFathead commented 2 years ago

Are you linking against the correct tesseract library? In particular, the one matching the included headers?

Hmm, good question!

At least for the tessdata (assuming that's what you meant by Tesseract libraries?), they seem to be in two places:

/usr/share/tessdata
/usr/share/tesseract-ocr/4.00/tessdata

I noticed that the issue had popped up elsewhere; i.e. especially in this thread: https://github.com/manisandro/gImageReader/issues/407#issuecomment-496127368

I tried setting the TESSDATA_PREFIX env-var but that didn't help either, neither i.e. setting cmake's flags to -DTESSDATA_PREFIX=/usr/share/tessdata -- has the functionality been obsoleted?

CMake Warning:
  Manually-specified variables were not used by the project:

    TESSDATA_PREFIX

All further tips on what I might be missing here would be highly appreciated. Thanks.

manisandro commented 2 years ago

TESSDATA_PREFIX is a runtime environment variable which defines where the tessdata files are located, it is not related to any build-time setting. gImageReader detects tesseract via pkg-config, see https://github.com/manisandro/gImageReader/blob/master/CMakeLists.txt#L58. Check that pkg-config returns the desired tesseract includes and libs, if not, either tweak the gimagereader CMakeLists.txt or set PKG_CONFIG_LIBDIR to the directory where the tesseract.pc of your desired installation is located.

FlyingFathead commented 2 years ago

TESSDATA_PREFIX is a runtime environment variable which defines where the tessdata files are located, it is not related to any build-time setting. gImageReader detects tesseract via pkg-config, see https://github.com/manisandro/gImageReader/blob/master/CMakeLists.txt#L58. Check that pkg-config returns the desired tesseract includes and libs, if not, either tweak the gimagereader CMakeLists.txt or set PKG_CONFIG_LIBDIR to the directory where the tesseract.pc of your desired installation is located.

Okay, thanks once more for the clarification!

pkg-config --list-all does show Tesseract in the list, and /usr/include/tesseract in the CMakeLists.txt that you linked to is (and has been) the correct location for -ltesseract ...

The tesseract.pc file also seems to be in place in /usr/lib/pkgconfig — nevertheless, since PKG_CONFIG_PATH had not been set separately as an environment variable, I tried it once more by first setting and exporting the pkg-config path env-var with export PKG_CONFIG_PATH=/usr/lib/pkgconfig just in case, yet no luck -- the build still fails at the same spot as previously mentioned.

I'm beginning to wonder if building Tesseract from source has left something crucial (for compiling gImageReader) uncompiled?

FlyingFathead commented 2 years ago

Solved it!

I nuked and paved any residues in dpkg of libleptonica-dev related material as well and then manually went for a bit of search & destroy -- i.e. cleared /usr/include/leptonica , re-compiled both leptonica and tesseract from source, and now the compile of gImageReader WORKS! 👍

Thanks so much for your help, have a nice day!

manisandro commented 2 years ago

Glad you solved it, cheers