Closed mpat654 closed 1 year ago
The problem is (probably, since I am not near a computer to debug it) that BlockFlow
is not a splittable LayoutElement
. So this thing will always try to perform layout on a single Page
.
Since it is too big to fit on a Page
, it fails.
You could overwrite my MarkdownToPDF
class, and hijack some methods to estimate how tall the element needs to be. Then, when you're at a page boundary, call the methods in PageLayout
to switch to a new Page
.
Thank you for the reply. Overwriting classes is at the limit of my Python knowledge,so it will be something I will look at in more detail. For now I have split the markdown string according to number of lines, and combine the result documents generated from your existing method. I am sure this is not efficient at all, but I will try and switch to your suggest in the future.
Many thanks again
I'll have a look later when the holiday season has come to a close.
Can you provide me with an example markdown document that triggers the error?
Kind regards, Joris Schellekens
That's greatly appreciated.
Example markdown (generated here https://jaspervdj.be/lorem-markdownum/):
`
Lorem markdownum! In erat reverti sed: sonent fertque ad Niobe cum o sequitur. Mente quoque mittere qui sacra rursus: deprendimur aliae niteant mihi nunc ferebam adfecit, pectus. Cantus Austro potest Polydori arsit ora atras capillis ad famulus. Multorum neque, protinus, praebentque erexit inque, iusserat.
Iacentes quid; ante est adspergine finiat: o vetitos ministro Cereris fortibus Haud! Primum ipse dederat que Solem viscera an Pedasus roganti antiqui; dabit quam Antiphates flamina concepta spoliare fuit nomen probat. Remis domumque si quodque percipit virus incoluit invenit quorum, quam inmunibus me habuit nomina at fovet tuo: sinistrae. Hesternos alium maius.
ruby.rfid = flashMetadataFile;
tunneling.device = remoteLeakScroll * 1 + simplex_boot_quad +
python_vlog_compatible;
if (reciprocal + barTerminalIeee + rootQuery + pciSoft) {
technologyMegahertz.ipodHard.memoryFi(21, 3,
sector_macro.editor_infotainment_partition(mouse_page));
page(install, regular(1, favorites));
}
superscalar -= cssWormWin + rjClient * character_recursive +
jumperWpaGoogle(pointKindleBezel, ring_isa_lan + -2);
if (web_drive_compression.wepCacheBoot(icio_rdf_media * domain) >
dual_iphone_file - domain) {
smtp_card_laser.systrayCrop = clusterCopyrightRay;
checksumSpyware += documentWinIde.mountFormatDisk(browser_box,
spyware_nui_sector, checksumHeuristicCard(font));
}
Finita manibus, trahens flectere pisce Alba Ityn caespite stricto Lelex vento. Rus poscunt dicenti: audit et concubitusque Ausonio fertur limina regionis, ne!
ringWiredNull = nativeRegistryWysiwyg;
ocr_network_bookmark.minimize = ping_cron_express + cd_ivr;
mountain -= error_petaflops / icf.fileGnutellaGuid(broadband, rwIcio, apache
+ systemBoot);
Ponit narrat: non vox propter refugit reddebant dubie te! Et lucemque conlocat labitur, est, ibi vidi, digitoque strepitus patrias verum lacrimas. Lucifer nymphis, et latum neque; lateri precor poteram adusque marmore, Solis roganti et murmure sibi freto numina. Indignere edidicitque poenam regni; cum sive?
Prohibent unus. In et, versis superatus melioribus edita temptatae testatus: dum optat sponsa ab falsaque Echecli.
Ad pulsavere inritata, quoque sim caecos? Pro est questaque regia. Celasse Eurynomus ramis, rex Lycaei hanc manus tumefactum caecisque sibi frustra nullos.`
Hello. Just FYI - the same issue happens with long HTML content.
That doesn't surprise me. Markdown is first converted to HTML. Borb only renders HTML to PDF.
Fixed!
You will find the fix for this in the next release, but I'll include the code here as well.
First we need to make some changes to SingleColumnLayoutWithOverflow
.
This is the PageLayout
that we will use to automatically add converted HTML elements to a Page
.
Normally this PageLayout
can only handle splitting Table
objects.
But here we'll explictly add support for splitting BlockFlow
objects.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
This implementation of PageLayout adds left/right/top/bottom margins to a Page
and lays out the content on the Page as if there were was a single column to flow text, images, etc into.
Once this column is full, the next page is automatically created.
"""
import copy
import typing
from decimal import Decimal
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf.canvas.layout.layout_element import LayoutElement
class SingleColumnLayoutWithOverflow(SingleColumnLayout):
"""
This implementation of PageLayout adds left/right/top/bottom margins to a Page
and lays out the content on the Page as if there were was a single column to flow text, images, etc into.
Once this column is full, the next page is automatically created.
"""
#
# CONSTRUCTOR
#
def __init__(
self,
page: "Page", # type: ignore [name-defined]
horizontal_margin: typing.Optional[Decimal] = None,
vertical_margin: typing.Optional[Decimal] = None,
):
super(SingleColumnLayoutWithOverflow, self).__init__(
page, horizontal_margin, vertical_margin
)
#
# PRIVATE
#
@staticmethod
def _prepare_table_for_relayout(layout_element: LayoutElement):
from borb.pdf import Table
assert isinstance(layout_element, Table)
layout_element._previous_layout_box = None
layout_element._previous_paint_box = None
for tc in layout_element._content:
tc._previous_layout_box = None
tc._previous_paint_box = None
tc._forced_layout_box = None
tc._layout_element._previous_layout_box = None
tc._layout_element._previous_paint_box = None
def _split_table(
self, layout_element: LayoutElement, available_height: Decimal
) -> typing.List[LayoutElement]:
# find out at which row we ought to split the Table
from borb.pdf import Table
assert isinstance(layout_element, Table)
best_row_for_split: typing.Optional[int] = None
for i in range(0, layout_element._number_of_rows):
if any([x._row_span != 1 for x in layout_element._get_cells_at_row(i)]):
continue
prev_layout_box: typing.Optional[
Rectangle
] = layout_element._get_cells_at_row(i)[0].get_previous_layout_box()
assert prev_layout_box is not None
y: Decimal = prev_layout_box.get_y()
if y < 0:
continue
if y < available_height:
best_row_for_split = i
# unable to split
if best_row_for_split is None:
assert False, (
"%s is too tall to fit inside column / page."
% layout_element.__class__.__name__
)
# first half of split
t0 = copy.deepcopy(layout_element)
t0._number_of_rows = best_row_for_split + 1
t0._content = [
x
for x in t0._content
if all([y[0] <= best_row_for_split for y in x._table_coordinates])
]
SingleColumnLayoutWithOverflow._prepare_table_for_relayout(t0)
# second half of split
t1 = copy.deepcopy(layout_element)
t1._number_of_rows = layout_element._number_of_rows - best_row_for_split - 1
t1._content = [
x
for x in t1._content
if all([y[0] > best_row_for_split for y in x._table_coordinates])
]
for tc in t1._content:
tc._table_coordinates = [
(y - best_row_for_split - 1, x) for y, x in tc._table_coordinates
]
SingleColumnLayoutWithOverflow._prepare_table_for_relayout(t1)
# return
return [t0, t1]
def _split_blockflow(
self, layout_element: LayoutElement, available_height: Decimal
) -> typing.List[LayoutElement]:
from borb.pdf import BlockFlow
assert isinstance(layout_element, BlockFlow)
return layout_element._content
#
# PUBLIC
#
def add(self, layout_element: LayoutElement) -> "PageLayout": # type: ignore [name-defined]
"""
This method adds a `LayoutElement` to the current `Page`.
"""
# anything that isn't a Table gets added as expected
if layout_element.__class__.__name__ not in [
"BlockFlow",
"FlexibleColumnWidthTable",
"FixedColumnWidthTable",
]:
return super(SingleColumnLayout, self).add(layout_element)
# previous element is used to determine the paragraph spacing
assert self._page_height is not None
assert self._page_width is not None
previous_element_margin_bottom: Decimal = Decimal(0)
previous_element_y = self._page_height - self._vertical_margin_top
if self._previous_element is not None:
previous_element_y = (
self._previous_element.get_previous_layout_box().get_y()
)
previous_element_margin_bottom = self._previous_element.get_margin_bottom()
# calculate next available height
available_height: Decimal = (
previous_element_y
- self._vertical_margin_bottom
- self._get_margin_between_elements(self._previous_element, layout_element)
- max(previous_element_margin_bottom, layout_element.get_margin_top())
- layout_element.get_margin_bottom()
)
# switch to new column if needed
assert self._page_height
if available_height < 0:
self.switch_to_next_column()
return self.add(layout_element)
# ask LayoutElement to fit
lbox: Rectangle = layout_element.get_layout_box(
Rectangle(
self._horizontal_margin + layout_element.get_margin_left(),
Decimal(0),
self._column_width
- layout_element.get_margin_right()
- layout_element.get_margin_left(),
available_height,
)
)
if lbox.get_height() <= available_height:
return super(SingleColumnLayout, self).add(layout_element)
# split Table
if layout_element.__class__.__name__ in [
"FlexibleColumnWidthTable",
"FixedColumnWidthTable",
]:
for t in self._split_table(layout_element, available_height):
super(SingleColumnLayoutWithOverflow, self).add(t)
# split BlockFlow
if layout_element.__class__.__name__ in [
"BlockFlow",
]:
for t in self._split_blockflow(layout_element, available_height):
super(SingleColumnLayoutWithOverflow, self).add(t)
# return
return self
Once that's done we can use this PageLayout
in html_to_pdf.py
(line 1022)
# PageLayout
layout: PageLayout = SingleColumnLayoutWithOverflow(page)
And then we can update the test code test_export_markdown_to_pdf.py
by adding the following lines:
def test_document_012(self):
self._test_document("example-markdown-input-012.md")
I just copy/pasted the example you added here in the issue as example-markdown-input-012.md
.
We also need to change the _test_document
method:
def _test_document(self, file_to_convert: str):
# create output directory if it does not exist yet
if not self.output_dir.exists():
self.output_dir.mkdir()
txt: str = ""
path_to_json = Path(__file__).parent / file_to_convert
with open(path_to_json, "r") as json_file_handle:
txt = json_file_handle.read()
# convert
document: Document = Document()
page: Page = Page(
width=PageSize.A4_PORTRAIT.value[0], height=PageSize.A4_PORTRAIT.value[1]
)
document.add_page(page)
layout: PageLayout = SingleColumnLayoutWithOverflow(
page, vertical_margin=Decimal(0), horizontal_margin=Decimal(12)
)
layout.add(MarkdownToPDF.convert_markdown_to_layout_element(txt))
# store
out_file = self.output_dir / (file_to_convert.replace(".md", ".pdf"))
with open(out_file, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, document)
check_pdf_using_validator(out_file)
compare_visually_to_ground_truth(out_file)
This produces the following output:
example-markdown-input-012.pdf
Kind regards, Joris Schellekens
Hello
Thank you for this great library. I am using the example from the Markdown section it the documentation.
I am trying to convert a markdown article to PDF but am getting a:
AssertionError: BlockFlow is too tall to fit inside column / page.
I assume this is because the layout element is too large on the page. What would you suggest for long Markdown documents? My thoughts were adding each new paragraph as a new layout element , but I am unsure how to do this. Many thanks