Test framework for sphinx-simplepdf

kreuzberger commented 1 year ago

Is there a chance to implement a basic test framework? I dont know if i should / could takeover these from the other repositories of useblocks "as is".

The pdf output could be testet with some python pdftotext modules, available at pypi. E.g. to count pages, or get the text from individual pages and compare if some expected text appears

Impementing a "basic" test would be good, i feel motivated to add more tests :grinning:

danwos commented 1 year ago

I agree, a test framework would be great. But just checking for certain text is not enough for me. I would like to be able o check also the layout, so the tests cover for instance:

Does a table fit on the page
Is a page break used correctly
Is the used font-size/family/color correct
Is an image scaled correctly

A quick search hasn't found any promising solution for this.

@ubmarco: As PDF miner expert, do you have an idea how this could be achieved?

danwos commented 1 year ago

Maybe a solution would be to make a pixel-by-pixel comparison with a golden sample, which got checked once manually.

There is a question on PyMuPDF, which is discussing this: https://github.com/pymupdf/PyMuPDF/issues/584

technical concept (idea)

A test-case contains:

Sphinx project, which gets built by simplepdf
A PDF as golden-sample, which was checked once

Pytest-fixtures to:

Build the PDF from the Sphinx-project
Extract the textual content as JSON, so that it can be used for tests

A helper function like compare_pdf(new_pdf, golden_sample), which compares PDF pixel-by-pixel to check for layout problems.

So in the end, each test case defines its own little project and therefore PDF. There is no single PDF file for all test cases, which is containing everything for testing (like our demo-pdf).

ubmarco commented 1 year ago

I think we should both:

Read back a PDF into text representation, we could check

is text on pages that are planned
is text on the right location as planned
do tables have correct values in the cells
do images exist

We could use libpdf for this (a pdfplumber and pdfminer wrapper). This test targets directly where things went wrong. This can also detect whether tables wrapped. Keep in mind, PDFs have no understanding of words, sentences, tables. They just know letters, letter orientation, font and color. Tables are made of lines. So for proper table detection we need to use tables with borders.

Then we'll also need a image comparison to be sure the overall layout is still valid, colors match and to test theme updates. A quick search: perceptualdiff or a home-grown solution.

Getting all needed programs installed to the Github node that runs the test (e.g. pillow) might be a problem.

kreuzberger commented 1 year ago

The text solution would handle most of the test cases i have in mind. Maybe this handling could be used not only for sphinx-simple internal tests, also for the real document tests produced during build.

a pdf (one per test) test is also ok, but i am not sure if this is a) easy to maintain b) does not rely to much on weasyprint versions

Here is the question: The tests should not only tests against different sphinx versions, it should also maybe test against different weasyprint versions. This might also be trick to handle

danwos commented 1 year ago

The last point can be easily done by matrix tests. Which are supported by github actions. Sphinx-Needs does this by creating different test-envs based on python, sphinx and docutils versions.

One PDF per test has the advantage that the tests are isolated from each other and therefore normally easier to maintain,.