simonhaenisch / md-to-pdf

Hackable CLI tool for converting Markdown files to PDF using Node.js and headless Chrome.
https://www.npmjs.com/md-to-pdf
MIT License
1.11k stars 108 forks source link

feature: Test generated PDF files #68

Closed jooola closed 2 years ago

jooola commented 4 years ago

Problem:

Currently the library does not test the generated PDF files (might have missed it ), it is only testing if the file exists.

Solution:

In order to harden the PDF generation testing, I propose to check pattern within the generated PDF file, such as the file magic number or even if some objects are present in the generated file, such as pictures or links.

I created some tests in another tool that tries to find a pattern in a pdf file, it needs some improvement such as multiple line patterns, but feel free to check on that to get an idea.

https://github.com/jooola/markdown-to-pdf/blob/master/spec/helpers.js#L18-L25 https://github.com/jooola/markdown-to-pdf/blob/master/spec/helpers.js#L27-L40 https://github.com/jooola/markdown-to-pdf/blob/master/spec/markdown-to-pdf-spec.js#L49-L55 https://github.com/jooola/markdown-to-pdf/blob/master/spec/markdown-to-pdf-spec.js#L76-L89

simonhaenisch commented 4 years ago

Yup you are right that it's just testing whether the files have been generated but not whether their contents are correct. Maybe a md5 checksum check would already be sufficient, to make sure that after a code change the resulting pdfs are still the same (unless a change is supposed to change the result, in which case the checksum would need to be updated after manually making sure that the PDF is correct).

Another approach could be visual diff testing with screenshots... it would be possible to use page.screenshot for that, but that would be before the page becomes a PDF. Generating PNGs from the PDF is also possible though, I'm pretty sure.

The markdown file I use for the basic tests also just uses a handful of the available options, so there's a lot of tests that could be added (e. g. for the different pdf options, css/stylesheets, body class, etc).

Anyway, I'm kind of relying on the upstream tools to have good test coverage, and for the most part the current tests have been sufficient, so I've also just been relying on people to report bugs. Quite a few bugs were regarding cross-platform issues which I didn't run into myself because I'm too lazy to set up a Windows VM, but I've been able to deal with this since I'm using the Github Actions matrix testing.

TBH I don't think adding a file magic number test is very useful because that's something that the puppeteer tests should take care of (it's not my responsibility that the file content generated by page.pdf() is actually a pdf file). I like the idea of checking for existence/correctness of certain objects within the PDF though. However I don't have much capacity to work on md-to-pdf in the next 2 months or so. Feel free to open a PR if this is important to you.

jooola commented 4 years ago

Maybe a md5 checksum check would already be sufficient, to make sure that after a code change the resulting pdfs are still the same (unless a change is supposed to change the result, in which case the checksum would need to be updated after manually making sure that the PDF is correct).

Sadly that's not possible, the PDF include the generated date.

Try to inspect the start of the PDF files, you will see what I mean:

cat file.pdf | head -n 15

Here is a sample:

$ cat README.pdf| head -n 15
%PDF-1.4
%����
1 0 obj
<</Creator (Chromium)
/Producer (Skia/PDF m80)
/CreationDate (D:20200311143133+00'00')
/ModDate (D:20200311143133+00'00')>>
endobj
3 0 obj
<</ca 1
/BM /Normal>>
endobj
7 0 obj
<</Filter /FlateDecode
/Length 1534>> stream
...

Another approach could be visual diff testing with screenshots... it would be possible to use page.screenshot for that, but that would be before the page becomes a PDF. Generating PNGs from the PDF is also possible though, I'm pretty sure.

I feel this one will generate too much false positive. Also testing should be able to run in CI, so comparing 2 is really strict, or we need to find a software that allow comparing 2 picture and allow a percentage of similarity.

Anyway, I'm kind of relying on the upstream tools to have good test coverage, and for the most part the current tests have been sufficient, so I've also just been relying on people to report bugs. Quite a few bugs were regarding cross-platform issues which I didn't run into myself because I'm too lazy to set up a Windows VM, but I've been able to deal with this since I'm using the Github Actions matrix testing.

That's what I feel is right to do, we should create generic test, and leave CI handle the environment.

TBH I don't think adding a file magic number test is very useful because that's something that the puppeteer tests should take care of (it's not my responsibility that the file content generated by page.pdf() is actually a pdf file). I like the idea of checking for existence/correctness of certain objects within the PDF though. However I don't have much capacity to work on md-to-pdf in the next 2 months or so. Feel free to open a PR if this is important to you.

I think testing for links and picture size is already enough. I trust puppeteer and marked to work as expected, but we need to test what this package is handling, mostly file paths, and styles.

Also having extra e2e test is always useful, if they don't generate false positive.

I might open a PR once I changed my tool underlying library with yours.

simonhaenisch commented 4 years ago

find a software that allow comparing 2 picture and allow a percentage of similarity.

Yup that's what I was referring to, see e. g. https://stenciljs.com/docs/screenshot-visual-diff which Ionic uses heavily (I think it's built on top of Jest's snapshot testing but not sure).

simonhaenisch commented 2 years ago

Closing this because there are some basic tests partially checking the pdf content now.