jalan / pdftotext

Simple PDF text extraction
MIT License
867 stars 99 forks source link

Pass more arguments to pdftotext #66

Open zufj opened 4 years ago

zufj commented 4 years ago

First of all, thanks for the handy module!

I'd be interested in having access to more of the features offered by pdftotext/xpdf to tune the quality of the extracted text.

As far as I know it is not possible to pass arguments freely to pdftotext but there are a few hardcoded parameters (password, raw).

Would that be something you would be open to add?

I'm not fluent in C++ but it seems that I could get inspiration from the existing code to try to have my arguments in.

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

jalan commented 4 years ago

This library uses the poppler cpp interface, so you would first have to check if it exposes the functionality you desire.

Ekran commented 4 years ago

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

But the underlying pdftotext is from http://poppler.freedesktop.org! It is a different pdftotext engine!

options from there:

-layout              : maintain original physical layout
-fixed <fp>          : assume fixed-pitch (or tabular) text
-raw                 : keep strings in content stream order

But: I tested it with some pdf from my bank, to read old transactions. The option -layout did work for me. I assume it would output the same as -lineprinter would output.

You can of course call the XpdfReader from python. but then you would not need https://pypi.org/project/pdftotext/.

As jalan wrote, we have to look at the poppler interface. There we find: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/cpp/poppler-page.cpp#L282 This could be a starting point.

jeanmonet commented 3 years ago

@jalan Hi, I'm using this opened issue to suggest a few additions that would considerably widen the usage scope of this library.

pdftotext (poppler) does seems to expose the following parameters, at least via command-line:

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -nodiag              : discard diagonal text
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -cropbox             : use the crop box rather than media box
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)

The interesting ones that are missing but would be very helpful are:

  1. non_raw_non_physical_layout (see here)

I believe this would be the equivalent via command line of NOT setting either -raw or -layout. The current python wrapper seems to use the layout parameter by default, and only deactivate it when raw=True. But there should be the possibility to deactivate layout even if raw=False. It would be cool to have a layout parameter: layout=False.

  1. bbox and bbox-layout

Here is the bbox-layout output:

...
<page width="595.000000" height="841.000000">
...
    <flow>
      <block xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
        <line xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
          <word xMin="17.281000" yMin="276.220000" xMax="65.249000" yMax="285.156000">Blabla</word>
          <word xMin="67.465000" yMin="276.220000" xMax="76.793000" yMax="285.156000">blaa</word>
          <word xMin="79.009000" yMin="276.220000" xMax="127.441000" yMax="285.156000">balbla/word>
        </line>
      </block>
    </flow>

This could certainly be imported in Python via a list of tuples, something like this:

class WordBox(NamedTuple):
    x0: int
    y0: int
    x1: int
    y1: int
    word: str
    flow: int   # dunno what flow really is however
    block: int  # would index blocks in the order they appear. Each word belongs to a block
    line: int   # same for lines

Ideally, a page object in this case could contain some meta-info about the page (such as dimensions and page number) and the possibility to extract the list of words and their bounding box.

I can certainly extract all this info myself by calling pdftotext via the command line and parsing the output file, but it would be neat to have this machinery inside this Python wrapper. (I'm not proficient with C/C++ so can't help there)

jalan commented 3 years ago

I have been meaning to fix the layout regarding non_raw_non_physical_layout. When I created this library, poppler-cpp only exposed two different layouts, so that's what I used. But now it has three. I will make the default layout non_raw_non_physical_layout, to match the CLI tool.

jeanmonet commented 3 years ago

Awesome! Thanks for your work on this library.

Just to give an example of how I parse the output of -bbox-layout option which works well for my use case and unless mistaken captures all of the information:

def pdftotext_bbox_parse(content_box: str) -> list[PageWordBox]:
    """
    Given output of `pdftotext -bbox-layout`, parse & retrieve positional information.
    Parses the following kind of output from pdftotext:
        <head>
        </head>
        <body>
        <doc>
        <page width="595.000000" height="841.000000">
            <flow>
            <block xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <line xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <word xMin="277.060000" yMin="1.890400" xMax="361.276000" yMax="17.439040">Blablaa</word>
                <word xMin="365.380000" yMin="1.890400" xMax="392.454400" yMax="17.439040">blaaa</word>
                <word xMin="396.340000" yMin="1.890400" xMax="427.228480" yMax="17.439040">blaa</word>
                <word xMin="431.140000" yMin="1.890400" xMax="520.033120" yMax="17.439040">blaaaahhh</word>
                </line>
            </block>
            <block xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="54.176040">
                <line xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="26.816040">
            ...
    """
    soup = BeautifulSoup(content_box, features="lxml")
    pages: list[PageWordBox] = []
    idx_page, idx_flow, idx_block, idx_line, idx_word = -1, -1, -1, -1, -1
    for cur_page in soup.find_all("page"):
        idx_page += 1
        page = PageWordBox(n=idx_page, dim=cur_page.attrs)
        pages.append(page)
        for cur_flow in cur_page.find_all("flow"):
            idx_flow += 1
            flow = FlowWordBox(n=idx_flow, page=idx_page)
            page.flows.append(flow)
            for cur_block in cur_flow.find_all("block"):
                idx_block += 1
                block = BlockWordBox(n=idx_block,
                                     flow=idx_flow,
                                     box=Rectangle(*(float(n) for n in cur_block.attrs.values())))
                flow.blocks.append(block)
                page.blocks.append(block)
                for cur_line in cur_block.find_all("line"):
                    idx_line += 1
                    line = LineWordBox(n=idx_line,
                                       block=idx_block,
                                       box=Rectangle(*(float(n) for n in cur_line.attrs.values())))
                    block.lines.append(line)
                    flow.lines.append(line)
                    page.lines.append(line)
                    for cur_word in cur_line.find_all("word"):
                        idx_word += 1
                        word = WordBox(
                            *(float(n) for n in cur_word.attrs.values()),
                            s=cur_word.text,
                            flow=idx_flow,
                            block=idx_block,
                            line=idx_line,
                            n=idx_word)
                        line.words.append(word)
                        block.words.append(word)
                        flow.words.append(word)
                        page.words.append(word)
    return pages

PageWordBox, FlowWordBox, BlockWordBox, LineWordBox, WordBox are just some dataclasses I use to conveniently store the data.

jalan commented 3 years ago

I have created #83 to track fixing the layout options. This issue can remain to discuss adding any other options. I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked.

jeanmonet commented 3 years ago

Understood, thanks. I'm actually already using PyMuPDF and it's great, but seems to lack the layout-related options in pdftotext, so for me they complement each other. In case anyone needs, here is how I read pdftotext output when handled via command line:

def pdftotext_cli(path: str | Path, page_num: int | None = None, args: list[str] | None = None) -> str:
    """
    Example usage -> read the second page of PDF and return `-bbox-layout` information
        >>> pdftotext_cli(Path("/path/to/file.pdf"), page_num=2, args=["-bbox-layout"]))
    """
    if isinstance(path, str):
        path = Path(path)
    if not path.is_file:
        raise RuntimeError(f"Given path not a (pdf) file: {path!r}")
    page_arg = ("-f", str(page_num), "-l", str(page_num),) if page_num else []
    args = args or []
    with tempfile.NamedTemporaryFile() as temp_file:
        _ = subprocess.run(["pdftotext", str(path.absolute()), temp_file.name,
                            *page_arg,
                            *args],
                           check=True,)
        content = temp_file.read().decode()
    return content
ReMiOS commented 3 years ago

I would like the "nodiag" (and "layout" wich is already implemented) option in the pdftotext library

Usage: pdftotext [options] [] -nodiag : discard diagonal text

It seems Poppler already provides this feature: TextOutputDev.h bool discardDiag; // Diagonal text, i.e., text that is not close to one of the // 0, 90, 180, or 270 degree axes, is discarded. This is useful // to skip watermarks drawn on top of body text, etc.

TextOutputDev.cc // throw away diagonal chars if (discardDiag && diagonal) { charPos += nBytes; return; }

MohammedRakib commented 2 years ago

I might be late, but you guys can install pyxpdf (pip install pyxpdf) for all possible arguments provided by xpdf.

stefan6419846 commented 1 year ago

I am not sure whether it actually makes sense to have such a large request laying around here asking about lots of options, while it is not really clear which actually are available already. Wouldn't it make more sense to track the relevant and missing parts in dedicated, smaller issues?

YasminaFr commented 1 year ago

Hi @jalan is it possible to retrieve only some pages of the pdf. I don't want to retrieve everything and then filter only the pages that I want. I would like to optimize that. Can you tell me if there is a way to this please ? Something like those parameters (poppler): -f : first page to convert -l : last page to convert

stefan6419846 commented 1 year ago

When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

YasminaFr commented 1 year ago

Hi,

I compared pymupdf and pdftotext years ago and I realized that the text extracted from pdftotext was better than pymupdf. That’s why since I only use pdftotext for pdf text extraction. But I will try again.

Thank you for your help.

Yasmina.


De : Stefan @.> Envoyé : Thursday, March 2, 2023 1:29:07 PM À : jalan/pdftotext @.> Cc : YasminaFr @.>; Mention @.> Objet : Re: [jalan/pdftotext] Pass more arguments to pdftotext (#66)

When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

— Reply to this email directly, view it on GitHubhttps://github.com/jalan/pdftotext/issues/66#issuecomment-1451790677, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEW5HXSO5V6H2MOB774XLA3W2CHBHANCNFSM4OILXHLA. You are receiving this because you were mentioned.Message ID: @.***>

SahibYar commented 1 year ago

I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked

You can add this comment in ReadMe as well, for future references.

ahmed-bhs commented 1 year ago

@Ekran did you find a solution, to pass layout argument to true, I tried this with open(pdf, "rb") as f: pdf = pdftotext.PDF(f, layout=True)

Unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this func

benjamin-awd commented 1 year ago

@ahmed-bhs I think you may want:

pdf = pdftotext.PDF(f, physical=True)

https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html

ahmed-bhs commented 1 year ago

Yeah exactly, thank you so mush @benjamin-awd