tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.94k stars 9.37k forks source link

Test suite depends on submodule instead of using gtest from distribution #2838

Open kloczek opened 4 years ago

kloczek commented 4 years ago

Looks like some files are missing in dist tar ball.

[tkloczko@barrel tesseract-4.1.1]$ make check
Making check in src/arch
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/arch'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/arch'
Making check in src/ccutil
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccutil'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccutil'
Making check in src/viewer
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/viewer'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/viewer'
Making check in src/cutil
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/cutil'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/cutil'
Making check in src/opencl
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/opencl'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/opencl'
Making check in src/ccstruct
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccstruct'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccstruct'
Making check in src/dict
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/dict'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/dict'
Making check in src/classify
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/classify'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/classify'
Making check in src/wordrec
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/wordrec'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/wordrec'
Making check in src/textord
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/textord'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/textord'
Making check in src/lstm
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/lstm'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/lstm'
Making check in src/ccmain
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccmain'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/ccmain'
Making check in src/api
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/api'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/src/api'
Making check in .
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1'
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1'
Making check in tessdata
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata'
Making check in configs
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata/configs'
make[2]: Nothing to be done for 'check'.
make[2]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata/configs'
Making check in tessconfigs
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata/tessconfigs'
make[2]: Nothing to be done for 'check'.
make[2]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata/tessconfigs'
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata'
make[2]: Nothing to be done for 'check-am'.
make[2]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata'
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/tessdata'
Making check in doc
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/doc'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/doc'
Making check in unittest
make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make  apiexample_test applybox_test baseapi_test  bitvector_test cleanapi_test colpartition_test commandlineflags_test dawg_test denorm_test  fileio_test heap_test imagedata_test indexmapbidi_test intfeaturemap_test intsimdmatrix_test lang_model_test layout_test ligature_table_test linlsq_test lstm_recode_test lstm_squashed_test lstm_test lstmtrainer_test loadlang_test mastertrainer_test matrix_test  normstrngs_test nthitem_test osd_test pagesegmode_test  paragraphs_test params_model_test progress_test qrsequence_test recodebeam_test rect_test resultiterator_test scanutils_test shapetable_test stats_test  stringrenderer_test tablefind_test tablerecog_test tabvector_test  textlineprojection_test tfile_test unichar_test unicharcompress_test unicharset_test validate_grapheme_test validate_indic_test validate_khmer_test validate_myanmar_test validator_test libgtest.la libgtest_main.la libgmock.la libgmock_main.la libabseil.la
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
g++ -DHAVE_CONFIG_H -I. -I..  -O2 -DNDEBUG -DTESSBIN_DIR="\"/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1\"" -DLANGDATA_DIR="\"/home/tkloczko/rpmbuild/BUILD/langdata_lstm\"" -DTESSDATA_DIR="\"/home/tkloczko/rpmbuild/BUILD/tessdata\"" -DTESSDATA_BEST_DIR="\"/home/tkloczko/rpmbuild/BUILD/tessdata_best\"" -DTESTING_DIR="\"/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/test/testing\"" -DTESTDATA_DIR="\"/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/test/testdata\"" -DPANGO_ENABLE_ENGINE  -I../src/api -I../src/api -I../src/arch -I../src/ccmain -I../src/ccstruct -I../src/ccutil -I../src/classify -I../src/cutil -I../src/dict -I../src/display -I../src/lstm -I../src/textord -I../unittest/base -I../unittest/util -I../src/training -I../src/viewer -I../src/wordrec -I../abseil  -isystem ../googletest/googletest/include -isystem ../googletest/googlemock/include -I/usr/include/leptonica     -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/harfbuzz  -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16   -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -flto=auto -flto-partition=none -std=c++17 -c -o apiexample_test.o apiexample_test.cc
make[2]: *** No rule to make target '../googletest/googletest/src/gtest-all.cc', needed by '../googletest/googletest/src/libgtest_la-gtest-all.lo'.  Stop.
make[2]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make[1]: *** [Makefile:2656: check-am] Error 2
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make: *** [Makefile:502: check-recursive] Error 1
stweil commented 4 years ago

The dist tar ball does not include third party code which is needed for make check. Use the code from git if you want to run the checks.

kloczek commented 4 years ago

So it would be good to add those files to EXTRA_DIST variable.

Issue is that for now it is not possible to execute test suite using only dist tar balls.

zdenop commented 4 years ago

Test suite requires more external files, which are not (and never will be) distributed by tesseract. See https://github.com/tesseract-ocr/tesseract/blob/4.1/unittest/README.md

kloczek commented 4 years ago

Nevertheless it would be good to add something. Sometimes it is about "it is better than nothing".

stweil commented 4 years ago

I don't think that it would be reasonable to add files needed for make check to the distribution. Most people don't need them, so for those people those files would be simply unnecessary and cost bandwidth and disk space.

As @zdenop already wrote make check requires not only external code but also more files which cannot be added, for example because of license issues.

A more realistic solution would be enhancing the build rules. make check should either fail with a user friendly message if required files are missing, or it could try to add missing files.

@kloczek, a patch for that would be welcome. I'm afraid the few volunteers who currently try to improve Tesseract have more important priorities. That's at least true for me.

stweil commented 4 years ago

So it would be good to add those files to EXTRA_DIST variable. Issue is that for now it is not possible to execute test suite using only dist tar balls.

@kloczek, just to clarify: Are you talking about dist files made by Makefile? Then EXTRA_DIST could be used, but Tesseract does not offer such files. Or are you talking about dist files made by GitHub and provided at https://github.com/tesseract-ocr/tesseract/releases? As far as I know those files don't use EXTRA_DIST.

kloczek commented 3 years ago

I don't think that it would be reasonable to add files needed for make check to the distribution.

Almost all source dist tar balls are distributed with all necessary files to be able execute test suite.

Execution of that part is very important part of the packaging processes because it allow more or less confirm that just build binaries are OK. Usually that test suite as well are stressing many components of the build env. In other words remove some parts of the test suite which are causing that test suite is useless sooner or later will backfire project.

Really please understand that this is important.

kloczek commented 3 years ago

I see two separated issues: 1) using local copy of gtest and gmock 2) missing files in /test

For first one I just made quick patch only to prove that there is no reasons to use copy of the gtest and gmock and test unit binaries can be linked against system libgtest and libgmock shared libraries. Below patch is not finished because it is still necessary to add proper pkgconfig parts about detecting those libraries in configure.ac and CMakeLists.txt. Nevertheless there is no any technical reasons to prop[agate googletest files in tesseract

--- a/unittest/Makefile.am~     2019-12-26 14:21:51.000000000 +0000
+++ b/unittest/Makefile.am      2021-04-19 10:06:35.912844153 +0100
@@ -51,14 +51,10 @@
 endif # TENSORFLOW

 # Build googletest:
-check_LTLIBRARIES = libgtest.la libgtest_main.la libgmock.la libgmock_main.la
-libgtest_la_SOURCES = ../googletest/googletest/src/gtest-all.cc
-libgtest_la_CPPFLAGS = -I$(top_srcdir)/googletest/googletest/include -I$(top_srcdir)/googletest/googletest -pthread
-libgtest_main_la_SOURCES = ../googletest/googletest/src/gtest_main.cc
 ## libgtest_main_la_LIBADD = libgtest.la

 # Build Abseil (needed for some unit tests).
-check_LTLIBRARIES += libabseil.la
+check_LTLIBRARIES = libabseil.la
 libabseil_la_SOURCES =
 libabseil_la_SOURCES += ../abseil/absl/base/internal/cycleclock.cc
 libabseil_la_SOURCES += ../abseil/absl/base/internal/raw_logging.cc
@@ -87,30 +83,16 @@
 libabseil_la_SOURCES += ../abseil/absl/time/clock.cc
 libabseil_la_SOURCES += ../abseil/absl/time/duration.cc

-GMOCK_INCLUDES = -I$(top_srcdir)/googletest/googlemock/include \
-                 -I$(top_srcdir)/googletest/googlemock \
-                 -I$(top_srcdir)/googletest/googletest/include \
-                 -I$(top_srcdir)/googletest/googletest
-
-libgmock_la_SOURCES = ../googletest/googlemock/src/gmock-all.cc
-libgmock_la_CPPFLAGS = $(GMOCK_INCLUDES) \
-                       -pthread
-libgmock_main_la_SOURCES = ../googletest/googlemock/src/gmock_main.cc
-libgmock_main_la_CPPFLAGS = $(GMOCK_INCLUDES) \
-                            -pthread
-
 # Build unittests
 ABSEIL_LIBS = libabseil.la
-GTEST_LIBS =  libgtest.la libgtest_main.la -lpthread
-GMOCK_LIBS =  libgmock.la libgmock_main.la
+GTEST_LIBS =  -lgtest -lgtest_main -lpthread
+GMOCK_LIBS =  -lgmock
 TESS_LIBS = $(GTEST_LIBS)
 TESS_LIBS += $(top_builddir)/src/api/libtesseract.la $(libarchive_LIBS)
 TESS_LIBS += $(TENSORFLOW_LIBS)
 TRAINING_LIBS = $(top_builddir)/src/training/libtesseract_training.la
 TRAINING_LIBS += $(top_builddir)/src/training/libtesseract_tessopt.la
 TRAINING_LIBS += $(TESS_LIBS)
-AM_CPPFLAGS +=   -isystem $(top_srcdir)/googletest/googletest/include \
-                 -isystem $(top_srcdir)/googletest/googlemock/include

 check_PROGRAMS = apiexample_test
 if ENABLE_TRAINING

If are happy about above I can try to finish configure.ac and CMakeLists.txt and submit that as PR.

Second thing is merge tesseract repo with test files subrepo because even if with above patch I was able to compile everything test suite is failing because missing content of the test/

make[1]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make  apiexample_test applybox_test baseapi_test  bitvector_test cleanapi_test colpartition_test commandlineflags_test dawg_test denorm_test  fileio_test heap_test imagedata_test indexmapbidi_test intfeaturemap_test intsimdmatrix_test lang_model_test layout_test ligature_table_test linlsq_test lstm_recode_test lstm_squashed_test lstm_test lstmtrainer_test loadlang_test mastertrainer_test matrix_test  normstrngs_test nthitem_test osd_test pagesegmode_test  paragraphs_test params_model_test progress_test qrsequence_test recodebeam_test rect_test resultiterator_test scanutils_test shapetable_test stats_test  stringrenderer_test tablefind_test tablerecog_test tabvector_test  textlineprojection_test tfile_test unichar_test unicharcompress_test unicharset_test validate_grapheme_test validate_indic_test validate_khmer_test validate_myanmar_test validator_test libabseil.la
make[2]: Entering directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make[2]: stat: /home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/test/testing/phototest.tif: Too many levels of symbolic links
mkdir -p ../test/testing
ln -s /home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/test/testing/phototest.tif ../test/testing/phototest.tif
ln: failed to create symbolic link '../test/testing/phototest.tif': File exists
make[2]: *** [Makefile:2878: /home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/test/testing/phototest.tif] Error 1
make[2]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make[1]: *** [Makefile:2548: check-am] Error 2
make[1]: Leaving directory '/home/tkloczko/rpmbuild/BUILD/tesseract-4.1.1/unittest'
make: *** [Makefile:505: check-recursive] Error 1
stweil commented 3 years ago

Almost all source dist tar balls are distributed with all necessary files to be able execute test suite.

The more complex software packages which I know don't include all files which are required for the test suite in their source dist tar ball, simply because that would make that tar ball too large.

I know that recent Linux distributions can at least provide googletest, but they still don't include abseil. In addition to those sources, a lot of other files are required for make check. Replacing the well defined revision of googletest by a distribution package does not solve the problem, but creates a new one because there might be compatiblity issues and for other platforms like Windows we don't have such packages.

The better way to go is fixing the build rules to get all missing required files for unit tests from online sources. The Git submodules can be initialized automatically if they are missing, font files can be fetched, model files can be fetched, too.

stweil commented 3 years ago

Issue is that for now it is not possible to execute test suite using only dist tar balls.

That is possible. It only requires some git clone and wget or curl commands. Feel free to add the necessary documentation and/or a script which does the necessary steps and send a pull request for 4.1. Git master already contains more build code to do that automatically, but even there some steps are still missing.

kloczek commented 3 years ago

Issue is that many build envs (including mine) are deliberately) are cut of from public network. Just checked all my src.rpm packages I found that many packages are using 0.5-0.8 GB of compressed source tar balls. Really .. as "big" is term of so called fuzzy logic and exact value of that term depends only on some reference value :) Usually separation of some test data is only because that those files are using different license.

For example OpenCV is using separated tar ball with test data however those data are kept in separated repo which is always synchronously tagged with the same git tag on release as main OpenCV repo and all what you need is just unpack test data in OpenCV source tree.

kloczek commented 3 years ago

I know that recent Linux distributions can at least provide googletest, but they still don't include abseil

Where is that project? (cannot find it). If it provides .pc file all what you need is just star requiring that in ac/cmake. People will find it and package as everything else :)

stweil commented 3 years ago

Abseil does not provide a .pc file, but you can download a source tar ball (like for Googletest, too).

stweil commented 3 months ago

Comment from issue #3673:

5.4.1 has one issue. It uses bundled googletest included in source tree as submodules. Why not use system installed gtest? 🤔 All distros provides gtest ..

At least for macOS (Homebrew) and for Windows I don't know gtest packages which could be used. Therefore we must support a local solution like the current one with a Git submodule. Additional support for gtest from a Linux distribution is possible, but needs someone who implements it and sends a pull request.