Closed jalan closed 3 years ago
Fixed in version 2.2.0
@jalan I tried to pass the new lyout argument like this :
with open(pdf, "rb") as f:
pdf = pdftotext.PDF(f, layout=True)
unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this function
@jalan I tried to pass the new lyout argument like this :
with open(pdf, "rb") as f: pdf = pdftotext.PDF(f, layout=True)
unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this function
That's not the name of the option.
$ python
>>> import pdftotext
>>> help(pdftotext.PDF)
Help on class PDF in module pdftotext:
class PDF(builtins.object)
| PDF(pdf_file, password="", raw=False, physical=False)
|
| Args:
| pdf_file: A file opened for reading in binary mode.
| password: Unlocks the document, if required. Either the owner
| password or the user password works.
| raw: If True, page text is output in the order it appears in the
| content stream.
| physical: If True, page text is output in the order it appears on
| the page, regardless of columns or other layout features.
|
| Usually, the most readable output is achieved by using the default
| mode, rather than raw or physical.
|
| Example:
| with open("doc.pdf", "rb") as f:
| pdf = PDF(f)
| for page in pdf:
| print(page)
|
| Methods defined here:
|
| __getitem__(self, key, /)
| Return self[key].
|
| __init__(self, /, *args, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| __len__(self, /)
| Return len(self).
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
@jalan so no way to perform the layout argument with this package ?
@jalan so no way to perform the layout argument with this package ?
You have three layout options:
I leave it up to you to read the help text I just posted and decide which one you want.
The correct choice is 'Physical,' and I appreciate your assistance. Last question is, is there a way to restrict the number of pages read to, for example, 5 first pages instead of reading the hole pdf ?"
The correct choice is 'Physical,' and I appreciate your assistance. Last question is, is there a way to restrict the number of pages read to, for example, 5 first pages instead of reading the hole pdf ?"
Just... access the pages you want and don't access the ones you don't want
Since poppler 0.88, the cpp interface provides three layouts. The new layout,
non_raw_non_physical_layout
, should be the default.layout
kwarg that defaults toFalse
non_raw_non_physical_layout
. On older poppler, default tophysical_layout
See https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html