jalan / pdftotext

Simple PDF text extraction
MIT License
870 stars 99 forks source link

Allow use of all three layout options #83

Closed jalan closed 3 years ago

jalan commented 3 years ago

Since poppler 0.88, the cpp interface provides three layouts. The new layout, non_raw_non_physical_layout, should be the default.

See https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html

jalan commented 3 years ago

Fixed in version 2.2.0

ahmed-bhs commented 1 year ago

@jalan I tried to pass the new lyout argument like this :

with open(pdf, "rb") as f:
    pdf = pdftotext.PDF(f, layout=True)

unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this function

jalan commented 1 year ago

@jalan I tried to pass the new lyout argument like this :

with open(pdf, "rb") as f:
    pdf = pdftotext.PDF(f, layout=True)

unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this function

That's not the name of the option.

$ python
>>> import pdftotext
>>> help(pdftotext.PDF)

Help on class PDF in module pdftotext:

class PDF(builtins.object)
 |  PDF(pdf_file, password="", raw=False, physical=False)
 |  
 |  Args:
 |      pdf_file: A file opened for reading in binary mode.
 |      password: Unlocks the document, if required. Either the owner
 |          password or the user password works.
 |      raw: If True, page text is output in the order it appears in the
 |          content stream.
 |      physical: If True, page text is output in the order it appears on
 |          the page, regardless of columns or other layout features.
 |  
 |      Usually, the most readable output is achieved by using the default
 |      mode, rather than raw or physical.
 |  
 |  Example:
 |      with open("doc.pdf", "rb") as f:
 |          pdf = PDF(f)
 |      for page in pdf:
 |          print(page)
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
ahmed-bhs commented 1 year ago

@jalan so no way to perform the layout argument with this package ?

jalan commented 1 year ago

@jalan so no way to perform the layout argument with this package ?

You have three layout options:

I leave it up to you to read the help text I just posted and decide which one you want.

ahmed-bhs commented 1 year ago

The correct choice is 'Physical,' and I appreciate your assistance. Last question is, is there a way to restrict the number of pages read to, for example, 5 first pages instead of reading the hole pdf ?"

jalan commented 1 year ago

The correct choice is 'Physical,' and I appreciate your assistance. Last question is, is there a way to restrict the number of pages read to, for example, 5 first pages instead of reading the hole pdf ?"

Just... access the pages you want and don't access the ones you don't want