jorisschellekens / borb-examples

414 stars 58 forks source link

Markdown to PDF for long document #24

Closed mpat654 closed 1 year ago

mpat654 commented 1 year ago

Hello

Thank you for this great library. I am using the example from the Markdown section it the documentation.

I am trying to convert a markdown article to PDF but am getting a:

AssertionError: BlockFlow is too tall to fit inside column / page.

I assume this is because the layout element is too large on the page. What would you suggest for long Markdown documents? My thoughts were adding each new paragraph as a new layout element , but I am unsure how to do this. Many thanks

jorisschellekens commented 1 year ago

The problem is (probably, since I am not near a computer to debug it) that BlockFlow is not a splittable LayoutElement. So this thing will always try to perform layout on a single Page.

Since it is too big to fit on a Page, it fails.

You could overwrite my MarkdownToPDF class, and hijack some methods to estimate how tall the element needs to be. Then, when you're at a page boundary, call the methods in PageLayout to switch to a new Page.

mpat654 commented 1 year ago

Thank you for the reply. Overwriting classes is at the limit of my Python knowledge,so it will be something I will look at in more detail. For now I have split the markdown string according to number of lines, and combine the result documents generated from your existing method. I am sure this is not efficient at all, but I will try and switch to your suggest in the future.

Many thanks again

jorisschellekens commented 1 year ago

I'll have a look later when the holiday season has come to a close.

Can you provide me with an example markdown document that triggers the error?

Kind regards, Joris Schellekens

mpat654 commented 1 year ago

That's greatly appreciated.

Example markdown (generated here https://jaspervdj.be/lorem-markdownum/):

`

Indomitae adspice

Sopita scires

Lorem markdownum! In erat reverti sed: sonent fertque ad Niobe cum o sequitur. Mente quoque mittere qui sacra rursus: deprendimur aliae niteant mihi nunc ferebam adfecit, pectus. Cantus Austro potest Polydori arsit ora atras capillis ad famulus. Multorum neque, protinus, praebentque erexit inque, iusserat.

  1. Excivere volentem dixit sues Achilles Auroram stagnare
  2. Turbaeque novaeque fila
  3. Vidi conlapsamque arvis mirata Erectheus Ismenos diva
  4. Doleret convertunt cecidere obliqua remittant inpiger felicem
  5. Inexpugnabile absit nolle quo ex se referre
  6. Montes genibus tauro quisque potens

Iacentes quid; ante est adspergine finiat: o vetitos ministro Cereris fortibus Haud! Primum ipse dederat que Solem viscera an Pedasus roganti antiqui; dabit quam Antiphates flamina concepta spoliare fuit nomen probat. Remis domumque si quodque percipit virus incoluit invenit quorum, quam inmunibus me habuit nomina at fovet tuo: sinistrae. Hesternos alium maius.

ruby.rfid = flashMetadataFile;
tunneling.device = remoteLeakScroll * 1 + simplex_boot_quad +
        python_vlog_compatible;
if (reciprocal + barTerminalIeee + rootQuery + pciSoft) {
    technologyMegahertz.ipodHard.memoryFi(21, 3,
            sector_macro.editor_infotainment_partition(mouse_page));
    page(install, regular(1, favorites));
}
superscalar -= cssWormWin + rjClient * character_recursive +
        jumperWpaGoogle(pointKindleBezel, ring_isa_lan + -2);
if (web_drive_compression.wepCacheBoot(icio_rdf_media * domain) >
        dual_iphone_file - domain) {
    smtp_card_laser.systrayCrop = clusterCopyrightRay;
    checksumSpyware += documentWinIde.mountFormatDisk(browser_box,
            spyware_nui_sector, checksumHeuristicCard(font));
}

Illis si libat

Finita manibus, trahens flectere pisce Alba Ityn caespite stricto Lelex vento. Rus poscunt dicenti: audit et concubitusque Ausonio fertur limina regionis, ne!

ringWiredNull = nativeRegistryWysiwyg;
ocr_network_bookmark.minimize = ping_cron_express + cd_ivr;
mountain -= error_petaflops / icf.fileGnutellaGuid(broadband, rwIcio, apache
        + systemBoot);

Ponit narrat: non vox propter refugit reddebant dubie te! Et lucemque conlocat labitur, est, ibi vidi, digitoque strepitus patrias verum lacrimas. Lucifer nymphis, et latum neque; lateri precor poteram adusque marmore, Solis roganti et murmure sibi freto numina. Indignere edidicitque poenam regni; cum sive?

Prohibent unus. In et, versis superatus melioribus edita temptatae testatus: dum optat sponsa ab falsaque Echecli.

Ad pulsavere inritata, quoque sim caecos? Pro est questaque regia. Celasse Eurynomus ramis, rex Lycaei hanc manus tumefactum caecisque sibi frustra nullos.`

mpat654 commented 1 year ago

Hello. Just FYI - the same issue happens with long HTML content.

jorisschellekens commented 1 year ago

That doesn't surprise me. Markdown is first converted to HTML. Borb only renders HTML to PDF.

jorisschellekens commented 1 year ago

Fixed!

You will find the fix for this in the next release, but I'll include the code here as well. First we need to make some changes to SingleColumnLayoutWithOverflow. This is the PageLayout that we will use to automatically add converted HTML elements to a Page. Normally this PageLayout can only handle splitting Table objects. But here we'll explictly add support for splitting BlockFlow objects.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
This implementation of PageLayout adds left/right/top/bottom margins to a Page
and lays out the content on the Page as if there were was a single column to flow text, images, etc into.
Once this column is full, the next page is automatically created.
"""

import copy
import typing
from decimal import Decimal

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf.canvas.layout.layout_element import LayoutElement

class SingleColumnLayoutWithOverflow(SingleColumnLayout):
    """
    This implementation of PageLayout adds left/right/top/bottom margins to a Page
    and lays out the content on the Page as if there were was a single column to flow text, images, etc into.
    Once this column is full, the next page is automatically created.
    """

    #
    # CONSTRUCTOR
    #

    def __init__(
        self,
        page: "Page",  # type: ignore [name-defined]
        horizontal_margin: typing.Optional[Decimal] = None,
        vertical_margin: typing.Optional[Decimal] = None,
    ):
        super(SingleColumnLayoutWithOverflow, self).__init__(
            page, horizontal_margin, vertical_margin
        )

    #
    # PRIVATE
    #

    @staticmethod
    def _prepare_table_for_relayout(layout_element: LayoutElement):
        from borb.pdf import Table

        assert isinstance(layout_element, Table)
        layout_element._previous_layout_box = None
        layout_element._previous_paint_box = None
        for tc in layout_element._content:
            tc._previous_layout_box = None
            tc._previous_paint_box = None
            tc._forced_layout_box = None
            tc._layout_element._previous_layout_box = None
            tc._layout_element._previous_paint_box = None

    def _split_table(
        self, layout_element: LayoutElement, available_height: Decimal
    ) -> typing.List[LayoutElement]:

        # find out at which row we ought to split the Table
        from borb.pdf import Table

        assert isinstance(layout_element, Table)
        best_row_for_split: typing.Optional[int] = None
        for i in range(0, layout_element._number_of_rows):
            if any([x._row_span != 1 for x in layout_element._get_cells_at_row(i)]):
                continue
            prev_layout_box: typing.Optional[
                Rectangle
            ] = layout_element._get_cells_at_row(i)[0].get_previous_layout_box()
            assert prev_layout_box is not None
            y: Decimal = prev_layout_box.get_y()
            if y < 0:
                continue
            if y < available_height:
                best_row_for_split = i

        # unable to split
        if best_row_for_split is None:
            assert False, (
                "%s is too tall to fit inside column / page."
                % layout_element.__class__.__name__
            )

        # first half of split
        t0 = copy.deepcopy(layout_element)
        t0._number_of_rows = best_row_for_split + 1
        t0._content = [
            x
            for x in t0._content
            if all([y[0] <= best_row_for_split for y in x._table_coordinates])
        ]
        SingleColumnLayoutWithOverflow._prepare_table_for_relayout(t0)

        # second half of split
        t1 = copy.deepcopy(layout_element)
        t1._number_of_rows = layout_element._number_of_rows - best_row_for_split - 1
        t1._content = [
            x
            for x in t1._content
            if all([y[0] > best_row_for_split for y in x._table_coordinates])
        ]
        for tc in t1._content:
            tc._table_coordinates = [
                (y - best_row_for_split - 1, x) for y, x in tc._table_coordinates
            ]
        SingleColumnLayoutWithOverflow._prepare_table_for_relayout(t1)

        # return
        return [t0, t1]

    def _split_blockflow(
        self, layout_element: LayoutElement, available_height: Decimal
    ) -> typing.List[LayoutElement]:
        from borb.pdf import BlockFlow
        assert isinstance(layout_element, BlockFlow)
        return layout_element._content

    #
    # PUBLIC
    #

    def add(self, layout_element: LayoutElement) -> "PageLayout":  # type: ignore [name-defined]
        """
        This method adds a `LayoutElement` to the current `Page`.
        """

        # anything that isn't a Table gets added as expected
        if layout_element.__class__.__name__ not in [
            "BlockFlow",
            "FlexibleColumnWidthTable",
            "FixedColumnWidthTable",
        ]:
            return super(SingleColumnLayout, self).add(layout_element)

        # previous element is used to determine the paragraph spacing
        assert self._page_height is not None
        assert self._page_width is not None
        previous_element_margin_bottom: Decimal = Decimal(0)
        previous_element_y = self._page_height - self._vertical_margin_top
        if self._previous_element is not None:
            previous_element_y = (
                self._previous_element.get_previous_layout_box().get_y()
            )
            previous_element_margin_bottom = self._previous_element.get_margin_bottom()

        # calculate next available height
        available_height: Decimal = (
            previous_element_y
            - self._vertical_margin_bottom
            - self._get_margin_between_elements(self._previous_element, layout_element)
            - max(previous_element_margin_bottom, layout_element.get_margin_top())
            - layout_element.get_margin_bottom()
        )

        # switch to new column if needed
        assert self._page_height
        if available_height < 0:
            self.switch_to_next_column()
            return self.add(layout_element)

        # ask LayoutElement to fit
        lbox: Rectangle = layout_element.get_layout_box(
            Rectangle(
                self._horizontal_margin + layout_element.get_margin_left(),
                Decimal(0),
                self._column_width
                - layout_element.get_margin_right()
                - layout_element.get_margin_left(),
                available_height,
            )
        )
        if lbox.get_height() <= available_height:
            return super(SingleColumnLayout, self).add(layout_element)

        # split Table
        if layout_element.__class__.__name__ in [
            "FlexibleColumnWidthTable",
            "FixedColumnWidthTable",
        ]:
            for t in self._split_table(layout_element, available_height):
                super(SingleColumnLayoutWithOverflow, self).add(t)

        # split BlockFlow
        if layout_element.__class__.__name__ in [
            "BlockFlow",
        ]:
            for t in self._split_blockflow(layout_element, available_height):
                super(SingleColumnLayoutWithOverflow, self).add(t)

        # return
        return self

Once that's done we can use this PageLayout in html_to_pdf.py (line 1022)

        # PageLayout
        layout: PageLayout = SingleColumnLayoutWithOverflow(page)

And then we can update the test code test_export_markdown_to_pdf.py by adding the following lines:

    def test_document_012(self):
        self._test_document("example-markdown-input-012.md")

I just copy/pasted the example you added here in the issue as example-markdown-input-012.md.

We also need to change the _test_document method:

    def _test_document(self, file_to_convert: str):

        # create output directory if it does not exist yet
        if not self.output_dir.exists():
            self.output_dir.mkdir()

        txt: str = ""
        path_to_json = Path(__file__).parent / file_to_convert
        with open(path_to_json, "r") as json_file_handle:
            txt = json_file_handle.read()

        # convert
        document: Document = Document()
        page: Page = Page(
            width=PageSize.A4_PORTRAIT.value[0], height=PageSize.A4_PORTRAIT.value[1]
        )
        document.add_page(page)
        layout: PageLayout = SingleColumnLayoutWithOverflow(
            page, vertical_margin=Decimal(0), horizontal_margin=Decimal(12)
        )
        layout.add(MarkdownToPDF.convert_markdown_to_layout_element(txt))

        # store
        out_file = self.output_dir / (file_to_convert.replace(".md", ".pdf"))
        with open(out_file, "wb") as pdf_file_handle:
            PDF.dumps(pdf_file_handle, document)

        check_pdf_using_validator(out_file)
        compare_visually_to_ground_truth(out_file)

This produces the following output:

example-markdown-input-012.pdf

Kind regards, Joris Schellekens