opendocument-app / OpenDocument.core

C++ library that translates office documents to HTML
GNU General Public License v3.0
24 stars 9 forks source link

Build with pdf2htmlEX #380

Closed ViliusSutkus89 closed 1 month ago

ViliusSutkus89 commented 3 months ago

WIP. Currently just linking. Need to add actual usage

ViliusSutkus89 commented 3 months ago

I'm gonna go with the conclusion that conan with msvc-1939 is broken:

freetype/2.13.2: Meson configure cmd: meson setup --native-file "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release\conan\conan_meson_native.ini" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\src" --prefix=/
freetype/2.13.2: RUN: meson setup --native-file "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release\conan\conan_meson_native.ini" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\src" --prefix=/
conanvcvars.bat: Activating environment Visual Studio 17 - amd64 - winsdk_version=None - vcvars_ver=14.3
[ERROR:vcvars.bat] Toolset directory for version '14.3' was not found.
[ERROR:VsDevCmd.bat] *** VsDevCmd.bat encountered errors. Environment may be incomplete and/or incorrect. ***
[ERROR:VsDevCmd.bat] In an uninitialized command prompt, please 'set VSCMD_DEBUG=[value]' and then re-run
[ERROR:VsDevCmd.bat] vsdevcmd.bat [args] for additional details.
[ERROR:VsDevCmd.bat] Where [value] is:
[ERROR:VsDevCmd.bat]    1 : basic debug logging
[ERROR:VsDevCmd.bat]    2 : detailed debug logging
[ERROR:VsDevCmd.bat]    3 : trace level logging. Redirection of output to a file when using this level is recommended.
[ERROR:VsDevCmd.bat] Example: set VSCMD_DEBUG=3
[ERROR:VsDevCmd.bat]          vsdevcmd.bat > vsdevcmd.trace.txt 2>&1

Do we need msvc-1939 support for odr.core? Freetype is not our recipe, so this may eventually be solved without our intervention. We could set up a nice test case and report the issue to conan though

ViliusSutkus89 commented 3 months ago

Fontforge fails to compile with msvc-19.40. I've checked Fontforge and they do actually ship Windows binaries, although I'm unsure if they compile with msvc. Not sure if this is worth pursuing, conan-odr-index should be the first place where we add extra compilers or windows support. Disabling msvc compiler here.

andiwand commented 3 months ago

We don't really use the msvc build but I like to keep it because it picks on stuff the other compilers don't. Generally I think pdf2htmlex should be a feature we can toggle at compile time so I would just turn it off for windows.

ViliusSutkus89 commented 3 months ago

Brought back the msvc compilers, 19.39 and 19.40, just for good measure.

Anyway, I would need some help integrating on the c++ side. With the current code on this PR, #include <pdf2htmlEX.h> works.

I assume the integration should happen somewhere in src/odr/html.cpp:

Html html::translate(const PdfFile &pdf_file, const std::string &output_path,
                     const HtmlConfig &config) {
  fs::create_directories(output_path);
#if WITH_PDF2HTMLEX
  return pdf2htmlex_wrapper_which_returns_html_object(pdf_file, output_path, config);
#else
  return internal::html::translate_pdf_file(pdf_file, output_path, config);
#endif
}

Any other ideas?

andiwand commented 2 months ago

I think this makes a lot of sense yes. I don't think we need this runtime switchable. Even if we can change this in the future.

ViliusSutkus89 commented 2 months ago

Resource files are ugly, I gather them in odr.droid/app/CMakeLists.txt and then tell gradle that asset merging has a dependency on CMake - https://github.com/opendocument-app/OpenDocument.droid/blob/d26811deef04dbba6d70b97c01967ec72557d641/app/build.gradle#L120 . I doubt it's possible to make the gathering somehow nicer.

Runtime side could be nicer though. Currently we extract them right before calling pdf2htmlEX. Would be possible to move that to app init, but it would make app init extra slow, so I don't really want to go that route.

Proper implementation would be to read assets in C++ directly from APK, using AssetManager, but I can't recall right now what was wrong with that route. Maybe it requires a high min API level. It definitely requires having access to the JNI Env object and AssetManager, obtained from Android's Context. These objects are bound to the specific Java thread, which does JNI, pdf2htmlEX doesn't do threading, I'm unsure about poppler. Anyway, this implementation requires some code.

ViliusSutkus89 commented 2 months ago

This is more or less usable. APK available at https://github.com/opendocument-app/OpenDocument.droid/actions/runs/10419751442

We still need to wire encrypted pdf's, no idea how to do that. Will need @andiwand to look at it.

Btw, it would be possible to keep both pdf converters, pdf2htmlEX and the one at odr.core, but with different entry points in open_document_reader.hpp. This way would we could have fallback mode, if pdf2htmlEX fails

andiwand commented 2 months ago

Keeping both would be nice. I think the odr.core internal one is not very usable yet but having a runtime switch would be nice. I don't think we have a mechanism for that yet because it would be the same file format and two different open and translate mechanisms. I need to think a bit on how to best model that.

Encrypted files are also supported by the core. We currently handle that in the DecodedFile class.

andiwand commented 2 months ago

@ViliusSutkus89 there seem to be some compile errors in the index repository https://github.com/opendocument-app/conan-odr-index/actions/runs/10423033599

ViliusSutkus89 commented 2 months ago

That previous compile error was because of some borked package that wasn't rebuilt properly. Current compile error was because I've messed it up when switching profiles RelWithDebInfo->Release. Sorry for merging straight to main, I thought it would just work the first time, lol :D

ViliusSutkus89 commented 2 months ago

We have a working version with both pdf2htmlEX and wvWare compiled as conan deps into libodr-core.so. Latest build available on this actions run in odr.droid.

Check out the space savings from de-duplication. With conan deps:

$ du -hs lib/*
23M lib/arm64-v8a
18M lib/armeabi-v7a
23M lib/x86
24M lib/x86_64

With regular gradle deps:

$ du -hs lib/*
33M lib/arm64-v8a
26M lib/armeabi-v7a
32M lib/x86
34M lib/x86_64

odr.droid#350 mentioned size more than doubling. Release apk of odr.droid-3.19 was 56M, odr.droid-3.26 is 125M. Current release apk is 98M.

I'm not yet completely sure why it's so big. In odr.droid-3.19 extracted lib folder was 109M, assets were 21M, Resources were 7.3M+1.1M, classes were 2M. In current conan build extracted lib folder is 87M, assets are 21M, Resources are 2.7M+1.1M, classes are 8.2M.

The total size was supposed to go down even smaller than that of 3.19, unless my numbers are wrong, because I'm comparing f-droid release to my build, when f-droid release could have been passed through some better compressor. Will try to investigate further. Possibly the end result will be less than 3.19's 56M.

@andiwand , I need your opinion on the current interface between DocLoader/PdfLoader and C++.

ViliusSutkus89 commented 2 months ago

Correction, properly working APK is in this actions run - https://github.com/opendocument-app/OpenDocument.droid/actions/runs/10549275544

ViliusSutkus89 commented 2 months ago

src/wvWare.c could be placed somewhere else

ViliusSutkus89 commented 2 months ago

Hmmm... these tests are passing on my machine. I wonder why CI is throwing up

2024-08-28T23:40:26.5479540Z [ RUN      ] all_test_files/pdf2htmlEXWrapperTests.html/odr_private_pdf_onepage_pdf
2024-08-28T23:40:26.5481694Z /home/runner/work/OpenDocument.core/OpenDocument.core/test/data/input/odr-private/pdf/onepage.pdf to output/odr-private/output/pdf2htmlEX/pdf/onepage.pdf
2024-08-28T23:40:26.5506970Z Preprocessing: 0/1
2024-08-28T23:40:26.5507446Z Preprocessing: 1/1
2024-08-28T23:40:27.2307481Z Working: 0/1
2024-08-28T23:40:27.2307952Z Working: 1/1
2024-08-28T23:40:27.2318653Z unknown file: Failure
2024-08-28T23:40:27.2319323Z Unknown C++ exception thrown in the test body.
2024-08-28T23:40:27.2319822Z 
2024-08-28T23:40:27.2320824Z [  FAILED  ] all_test_files/pdf2htmlEXWrapperTests.html/odr_private_pdf_onepage_pdf, where GetParam() = "odr-private/pdf/onepage.pdf" (684 ms)
ViliusSutkus89 commented 2 months ago

I've finally figured out why the tests were failing, turns out I've forgot to feed test VMs with ~/.conan2/p, which contains assets. Reference outputs are now generated and committed into appropriate repos. Naturally, I've had to run into another problem.

Why is actions/checkout getting old submodule revisions? Do I need to manually update git revisions somewhere?

Checkout with submodules should just work, right?

- uses: actions/checkout@v4
  with:
    submodules: true

.gitmodules does not list any revisions:

[submodule "test/data/input/odr-public"]
    path = test/data/input/odr-public
    url = https://github.com/opendocument-app/OpenDocument.test.git
[submodule "test/data/reference-output/odr-public"]
    path = test/data/reference-output/odr-public
    url = https://github.com/opendocument-app/OpenDocument.test.output.git
[submodule "test/data/input/odr-private"]
    path = test/data/input/odr-private
    url = git@github.com:opendocument-app/OpenDocument.test-private.git
[submodule "test/data/reference-output/odr-private"]
    path = test/data/reference-output/odr-private
    url = git@github.com:opendocument-app/OpenDocument.test-private.output.git

OpenDocument.test-private

From https://github.com/opendocument-app/OpenDocument.test-private
 * branch            508.... -> FETCH_HEAD
Submodule path 'test/data/input/odr-private': checked out '508...'

508.... is the third most recent commit in that repo. Commited on Dec 30, 2021. There are two more recents commits, made in summer of 2022.

OpenDocument.test

opendocument-app/OpenDocument.test checks the correct revision, c2cc81ba91b6145ff51801644169f4f01878556b is the HEAD

Submodule path 'test/data/input/odr-public': checked out 'c2cc81ba91b6145ff51801644169f4f01878556b'

OpenDocument.test-private.output

From https://github.com/opendocument-app/OpenDocument.test-private.output
 * branch            1b5... -> FETCH_HEAD
Submodule path 'test/data/reference-output/odr-private': checked out '1b5...'

1b5... is second most recent commit, checkout skipped my recently added commit.

OpenDocument.test.output

From https://github.com/opendocument-app/OpenDocument.test.output
 * branch            6138deea822cc17e940181fe3d99b6b6aef64551 -> FETCH_HEAD
Submodule path 'test/data/reference-output/odr-public': checked out '6138deea822cc17e940181fe3d99b6b6aef64551'

6138deea822cc17e940181fe3d99b6b6aef64551 is third most recent commit.

andiwand commented 2 months ago

I think for now it would be fine to smoke test the two libraries. Just picking one input file for each of them and translate it to a random directly checking that it does not fail. For the ref comparison I would try to integrate it into the existing translation mechanism first

ViliusSutkus89 commented 2 months ago

We are way past the smoke test now. Ref testing is already implemented. It's just that a fresh git clone with recursive submodules clones wrong reference outputs and that's why the CI fails.

Just checked and realized that a local git clone also has this behavior, so it's not GitHub actions issue, more of a git issue

ViliusSutkus89 commented 2 months ago

Turns out I have to update git submodules in the root project too

ViliusSutkus89 commented 2 months ago

Another issue:

 ├── pdf2htmlEX
│   ├── pdf
│   │   ├── empty.pdf
│   │   │   ├── document.html ✓
│   │   ├── style-various-1.pdf
Traceback (most recent call last):
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 233, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 223, in main
    result = print_results(results, args.a, args.b)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 160, in print_results
    cmp = future.result()
          ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 85, in compare
    return compare_files(path_a,
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 49, in compare_files
    return compare_html(a, b, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 33, in compare_html
    diff, (image_a, image_b) = html_render_diff(a, b, browser=browser)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 43, in html_render_diff
    image_a = screenshot(browser, to_url(a))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 23, in screenshot
    png = body.screenshot_as_png
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 325, in screenshot_as_png
    return b64decode(self.screenshot_as_base64.encode("ascii"))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 314, in screenshot_as_base64
    return self._execute(Command.ELEMENT_SCREENSHOT)["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
    self.error_handler.check_response(response)
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: [Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: chrome://remote/content/shared/Capture.sys.mjs :: capture.canvas :: line 137"  data: no]
Stacktrace:
capture.canvas@chrome://remote/content/shared/Capture.sys.mjs:137:62
takeScreenshot@chrome://remote/content/marionette/actors/MarionetteCommandsParent.sys.mjs:292:37

Still no idea what that means

andiwand commented 2 months ago

Not sure if mixing the test file output between integrated and not integrated parts is a good idea. I see that adding the new API is the easiest solution for now but the new tests couple quite tightly to the new api and the existing test repos.

ViliusSutkus89 commented 2 months ago

I could put pdf2htmlEX/wvWare reference outputs in test/data/reference-output/odr-public/output-pdf2htmlEX instead of test/data/reference-output/odr-public/output/pdf2htmlEX and test outputs in build/test/output/odr-public/output-pdf2htmlEX instead of build/test/output/odr-public/output/pdf2htmlEX.

andiwand commented 1 month ago

Let's get this in! I will try to follow up on some of my suggestions here https://github.com/opendocument-app/OpenDocument.core/pull/387 and will revisit the screenshotting there as well