Open zufj opened 4 years ago
This library uses the poppler cpp interface, so you would first have to check if it exposes the functionality you desire.
The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html
But the underlying pdftotext is from http://poppler.freedesktop.org! It is a different pdftotext engine!
options from there:
-layout : maintain original physical layout
-fixed <fp> : assume fixed-pitch (or tabular) text
-raw : keep strings in content stream order
But: I tested it with some pdf from my bank, to read old transactions. The option -layout
did work for me. I assume it would output the same as -lineprinter
would output.
You can of course call the XpdfReader from python. but then you would not need https://pypi.org/project/pdftotext/.
As jalan wrote, we have to look at the poppler interface. There we find: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/cpp/poppler-page.cpp#L282 This could be a starting point.
@jalan Hi, I'm using this opened issue to suggest a few additions that would considerably widen the usage scope of this library.
pdftotext (poppler) does seems to expose the following parameters, at least via command-line:
Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-r <fp> : resolution, in DPI (default is 72)
-x <int> : x-coordinate of the crop area top left corner
-y <int> : y-coordinate of the crop area top left corner
-W <int> : width of crop area in pixels (default is 0)
-H <int> : height of crop area in pixels (default is 0)
-layout : maintain original physical layout
-fixed <fp> : assume fixed-pitch (or tabular) text
-raw : keep strings in content stream order
-nodiag : discard diagonal text
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-listenc : list available encodings
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta
-cropbox : use the crop box rather than media box
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
The interesting ones that are missing but would be very helpful are:
non_raw_non_physical_layout
(see here)I believe this would be the equivalent via command line of NOT setting either -raw
or -layout
. The current python wrapper seems to use the layout
parameter by default, and only deactivate it when raw=True
. But there should be the possibility to deactivate layout
even if raw=False
. It would be cool to have a layout
parameter: layout=False
.
bbox
and bbox-layout
Here is the bbox-layout
output:
...
<page width="595.000000" height="841.000000">
...
<flow>
<block xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
<line xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
<word xMin="17.281000" yMin="276.220000" xMax="65.249000" yMax="285.156000">Blabla</word>
<word xMin="67.465000" yMin="276.220000" xMax="76.793000" yMax="285.156000">blaa</word>
<word xMin="79.009000" yMin="276.220000" xMax="127.441000" yMax="285.156000">balbla/word>
</line>
</block>
</flow>
This could certainly be imported in Python via a list of tuples, something like this:
class WordBox(NamedTuple):
x0: int
y0: int
x1: int
y1: int
word: str
flow: int # dunno what flow really is however
block: int # would index blocks in the order they appear. Each word belongs to a block
line: int # same for lines
Ideally, a page object in this case could contain some meta-info about the page (such as dimensions and page number) and the possibility to extract the list of words and their bounding box.
I can certainly extract all this info myself by calling pdftotext
via the command line and parsing the output file, but it would be neat to have this machinery inside this Python wrapper. (I'm not proficient with C/C++ so can't help there)
I have been meaning to fix the layout regarding non_raw_non_physical_layout
. When I created this library, poppler-cpp only exposed two different layouts, so that's what I used. But now it has three. I will make the default layout non_raw_non_physical_layout
, to match the CLI tool.
Awesome! Thanks for your work on this library.
Just to give an example of how I parse the output of -bbox-layout
option which works well for my use case and unless mistaken captures all of the information:
def pdftotext_bbox_parse(content_box: str) -> list[PageWordBox]:
"""
Given output of `pdftotext -bbox-layout`, parse & retrieve positional information.
Parses the following kind of output from pdftotext:
<head>
</head>
<body>
<doc>
<page width="595.000000" height="841.000000">
<flow>
<block xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
<line xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
<word xMin="277.060000" yMin="1.890400" xMax="361.276000" yMax="17.439040">Blablaa</word>
<word xMin="365.380000" yMin="1.890400" xMax="392.454400" yMax="17.439040">blaaa</word>
<word xMin="396.340000" yMin="1.890400" xMax="427.228480" yMax="17.439040">blaa</word>
<word xMin="431.140000" yMin="1.890400" xMax="520.033120" yMax="17.439040">blaaaahhh</word>
</line>
</block>
<block xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="54.176040">
<line xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="26.816040">
...
"""
soup = BeautifulSoup(content_box, features="lxml")
pages: list[PageWordBox] = []
idx_page, idx_flow, idx_block, idx_line, idx_word = -1, -1, -1, -1, -1
for cur_page in soup.find_all("page"):
idx_page += 1
page = PageWordBox(n=idx_page, dim=cur_page.attrs)
pages.append(page)
for cur_flow in cur_page.find_all("flow"):
idx_flow += 1
flow = FlowWordBox(n=idx_flow, page=idx_page)
page.flows.append(flow)
for cur_block in cur_flow.find_all("block"):
idx_block += 1
block = BlockWordBox(n=idx_block,
flow=idx_flow,
box=Rectangle(*(float(n) for n in cur_block.attrs.values())))
flow.blocks.append(block)
page.blocks.append(block)
for cur_line in cur_block.find_all("line"):
idx_line += 1
line = LineWordBox(n=idx_line,
block=idx_block,
box=Rectangle(*(float(n) for n in cur_line.attrs.values())))
block.lines.append(line)
flow.lines.append(line)
page.lines.append(line)
for cur_word in cur_line.find_all("word"):
idx_word += 1
word = WordBox(
*(float(n) for n in cur_word.attrs.values()),
s=cur_word.text,
flow=idx_flow,
block=idx_block,
line=idx_line,
n=idx_word)
line.words.append(word)
block.words.append(word)
flow.words.append(word)
page.words.append(word)
return pages
PageWordBox
, FlowWordBox
, BlockWordBox
, LineWordBox
, WordBox
are just some dataclasses I use to conveniently store the data.
I have created #83 to track fixing the layout options. This issue can remain to discuss adding any other options. I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked.
Understood, thanks. I'm actually already using PyMuPDF and it's great, but seems to lack the layout-related options in pdftotext, so for me they complement each other. In case anyone needs, here is how I read pdftotext output when handled via command line:
def pdftotext_cli(path: str | Path, page_num: int | None = None, args: list[str] | None = None) -> str:
"""
Example usage -> read the second page of PDF and return `-bbox-layout` information
>>> pdftotext_cli(Path("/path/to/file.pdf"), page_num=2, args=["-bbox-layout"]))
"""
if isinstance(path, str):
path = Path(path)
if not path.is_file:
raise RuntimeError(f"Given path not a (pdf) file: {path!r}")
page_arg = ("-f", str(page_num), "-l", str(page_num),) if page_num else []
args = args or []
with tempfile.NamedTemporaryFile() as temp_file:
_ = subprocess.run(["pdftotext", str(path.absolute()), temp_file.name,
*page_arg,
*args],
check=True,)
content = temp_file.read().decode()
return content
I would like the "nodiag" (and "layout" wich is already implemented) option in the pdftotext library
Usage: pdftotext [options]
It seems Poppler already provides this feature: TextOutputDev.h bool discardDiag; // Diagonal text, i.e., text that is not close to one of the // 0, 90, 180, or 270 degree axes, is discarded. This is useful // to skip watermarks drawn on top of body text, etc.
TextOutputDev.cc // throw away diagonal chars if (discardDiag && diagonal) { charPos += nBytes; return; }
I might be late, but you guys can install pyxpdf (pip install pyxpdf) for all possible arguments provided by xpdf.
I am not sure whether it actually makes sense to have such a large request laying around here asking about lots of options, while it is not really clear which actually are available already. Wouldn't it make more sense to track the relevant and missing parts in dedicated, smaller issues?
Hi @jalan is it possible to retrieve only some pages of the pdf. I don't want to retrieve everything and then filter only the pages that I want. I would like to optimize that. Can you tell me if there is a way to this please ?
Something like those parameters (poppler):
-f
When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction
Hi,
I compared pymupdf and pdftotext years ago and I realized that the text extracted from pdftotext was better than pymupdf. That’s why since I only use pdftotext for pdf text extraction. But I will try again.
Thank you for your help.
Yasmina.
De : Stefan @.> Envoyé : Thursday, March 2, 2023 1:29:07 PM À : jalan/pdftotext @.> Cc : YasminaFr @.>; Mention @.> Objet : Re: [jalan/pdftotext] Pass more arguments to pdftotext (#66)
When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction
— Reply to this email directly, view it on GitHubhttps://github.com/jalan/pdftotext/issues/66#issuecomment-1451790677, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEW5HXSO5V6H2MOB774XLA3W2CHBHANCNFSM4OILXHLA. You are receiving this because you were mentioned.Message ID: @.***>
I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked
You can add this comment in ReadMe as well, for future references.
@Ekran did you find a solution, to pass layout argument to true, I tried this with open(pdf, "rb") as f: pdf = pdftotext.PDF(f, layout=True)
Unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this func
@ahmed-bhs I think you may want:
pdf = pdftotext.PDF(f, physical=True)
https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html
Yeah exactly, thank you so mush @benjamin-awd
First of all, thanks for the handy module!
I'd be interested in having access to more of the features offered by pdftotext/xpdf to tune the quality of the extracted text.
As far as I know it is not possible to pass arguments freely to pdftotext but there are a few hardcoded parameters (password, raw).
Would that be something you would be open to add?
I'm not fluent in C++ but it seems that I could get inspiration from the existing code to try to have my arguments in.
The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html