Closed MartinThoma closed 2 years ago
Hi there,
I'll have a look at this as soon as I find time. I did recently find out my implementation of the LZW decode algorithm is not correct. So any pdf that uses it will not be handled correctly.
Maybe 2 of your input documents happen to use LZW? (Just speculation here).
I am a bit surprised by your results. I run a similar test on borb
to check its accuracy on text-extraction. The test is in the repo and is called test_extract_text_expect_ground_truth
. This test runs over more than 600 documents. borb
consistently scores around 90% there.
This corpus of PDF documents is also freely available in one of my GitHub repositories (called "pdf-corpus"). It's around 600 PDF documents gathered by Google queries such as "menu filetype:pdf" (and similar with terms such as invoice, book, etc).
I can at least give some hints why the results might be so bad:
The number of pages in a Document
can be queried in DocumentInfo
.
Document
has a method get_document_info()
.
To get the number of pages, simply call
document.get_document_info().get_number_of_pages()
I took the liberty of running your PDF documents through the veraPDF online checker. This tool is often used in the industry to check adherence. You can find it here.
Some small defects are ok (e.g. an image not being tagged, a colour space being poorly defined, etc). These mostly affect the ability to faithfully render the document in some faraway future. But some of the larger bugs are more problematic. Especially if the parser is very strict (which borb
is). These bugs typically mess up the tokenization of the document, and are no doubt the cause for some of the exceptions you are getting.
This is the (partial) output for 2201.00214.pdf
The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker
This is a problem. As soon as borb
hits this object, it will attempt to parse it as a stream and fail because the object does not follow the tokenization rules for streams.
The object number and generation number shall be separated by a single white-space character. The generation number and obj keyword shall be separated by a single white-space character. The object number and endobj keyword shall each be preceded by an EOL marker. The obj and endobj keywords shall each be followed by an EOL marker.
Again problematic, it means borb
is:
This is the partial output for 2201.00201.pdf
The file header line shall be immediately followed by a comment consisting of a % character followed by at least four characters, each of whose encoded byte values shall have a decimal value greater than 127
The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker
The value of the Length key specified in the stream dictionary shall match the number of bytes in the file following the LINE FEED character after the stream keyword and preceding the EOL marker before the endstream keyword
This is going to cause issues as well. borb
will attempt to read the number of bytes specified in the Length
entry, and it will get too many or too few bytes. Typically the information in streams is compressed, this compounds the problem. Suddenly you no longer have a valid font-file, or a valid cmap.
This is the partial output for 2201.00200.pdf
The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker
The value of the Length key specified in the stream dictionary shall match the number of bytes in the file following the LINE FEED character after the stream keyword and preceding the EOL marker before the endstream keyword
The file header line shall be immediately followed by a comment consisting of a % character followed by at least four characters, each of whose encoded byte values shall have a decimal value greater than 127 | Failed
I stopped checking after that. As they say in software engineering; garbage in, garbage out. And it seems your input documents have a hard time actually being a PDF.
Thank you for the feedback!
Those PDFs were generated by pdflatex. 3/13 of those PDFs cause issues to borb. They are perfectly fine viewable by all viewers I have + all readers I know can extract something reasonable.
While I understand the "garbage in, garbage out" feeling, the sad truth is that any useful pdf text extraction library needs to deal with non-standard pdfs.
You mentioned that borb is tested against many pdf files. Would you mind giving me a pointer to those?
Hi there,
borb
is a very new library. Libraries or viewers such as Adobe have years and years of experience with opening poorly formatted PDF documents.
Adobe in particular is know to be extremely forgiving when opening a PDF.
I've already added some laxness to borb
, for instance reading a document with a broken xref
. But it's a delicate balance.
You can find my test-corpus here: https://github.com/jorisschellekens/pdf-corpus
Kind regards, Joris Schellekens
I have the impression you don't realize how isolated the parsing problem of borb is.
Here is a complete list of viewer / libraries that can display/parse https://arxiv.org/pdf/1601.03642.pdf just fine:
Here is the complete list of viewers that had any issues to display the PDF:
The parsing tool https://demo.verapdf.org/ you mentioned also only does PDF/A
validation. PDF/A is a standard built on top of PDF. There are a lot of valid PDF documents, which are not valid PDF/A documents.
According to https://www.pdf-online.com/osa/validate.aspx, it is a valid PDF 1.5 file.
I just had a look at a randomly picked example of the corpus and I see several issues in your ground truth:
Whit e Pa pe r on
Int er cult ur al D ialo gue
That is completely broken. If the value from above is expected, my benchmark will (rightfully so) show that borb performs really bad.
I would say for the first part, the ground truth should be:
White Paper on Intercultural Dialogue
Hi,
VeraPDF is meant to check adherence to certain standards within the PDF realm. Including, but not limited to PDF/A.
The errors I presented in your input documents are actually errors against the ISO standard. I already tried explaining that to you in a non-technical way.
I'm quite aware of the fact that borb has limitations when it comes to parsing a pdf document. As mentioned before, I do run a large test-suite.
As far as your particular input documents go, I'll have a look at them when I have time. Borb is not my main job, and there aren't any other devs currently working on borb.
So congratulations to you, your snark and sarcasm just irritated the only dev capable of helping you.
Or, to put it in terms you may relate to:
List of all people that could have helped you:
List of people that feels even remotely tempted to help you:
Kind regards, Joris
I'm sorry, I didn't want to hurt you / be disrespectful.
You also seem to have gotten the context wrong: I don't use/need borb. I am the maintainer of PyPDF2. I want the Python community to have good tools to interact with PDF. I'm trying to get the PyPDF3 / PyPDF4 projects back to PyPDF2 so that users get less confused about what they should use and that the software overall just gets better. I'm also trying to wrap my head around if the Python developers around PDF files could collaborate more. In that context I created the benchmark + contacted you. I want borb to look as good in the benchmark as possible - and see if borb does some parts better than other libraries.
I was (and still am) confused by the fact that you closed a bug ticket.
One part I was also thinking about is what results one would actually expect from text extraction of multi-page documents. What is a good ground-truth? Do we have multiple use-cases that require very different extracted texts from the same documents?
The question how to deal with non-compliant PDFs is also not completely clear to answer. For PyPDF2, we have decided that there are two options:
I currently have other priorities than dealing with non-compliant pdf documents.
For my text-extraction test, I simply do a frequency-map of all characters (excluding whitespace, linebreak, etc) and compare it to a frequency-map of the ground truth.
The reasoning behind that is two-fold:
My tests, and it's input/output are completely transparent, their source code can be viewed online. They may not be perfect, but they are a good proxy.
Interestingly, when I opened your documents by providing a Path
rather than a byte array, borb
opened all of them without issues (at least the dev branch does).
I closed the ticket because:
Kind regards, Joris
For completeness, this is what I get when I run text extraction against your document:
def test_against_smaller_corpus(self):
l: SimpleTextExtraction = SimpleTextExtraction()
with open("/home/joris/Desktop/smaller_corpus/1601.03642.pdf", "rb") as fh:
PDF.loads(fh, [l])
print(l.get_text_for_page(0))
output:
/usr/bin/python3.8 /snap/pycharm-community/278/plugins/python-ce/helpers/pycharm/_jb_unittest_runner.py --target test_open_document.TestOpenDocument.test_against_smaller_corpus
Testing started at 15:03 ...
Launching unittests with arguments python -m unittest test_open_document.TestOpenDocument.test_against_smaller_corpus in /home/joris/Code/borb-dev/tests/corpus
Ran 1 test in 4.795s
OK
Process finished with exit code 0
1
Creativity in Machine Learning
Martin Thoma
E-Mail: info@martin-thoma.de
Abstract—Recent machine learning techniques can be modified to produce creative results. Those results did not exist before; it
is not a trivial combination of the data which was fed into the machine learning system. The obtained results come in multiple
forms: As images, as text and as audio.
This paper gives a high level overview of how they are created and gives some examples. It is meant to be a summary of the
current work and give people who are new to machine learning some starting points.
I. I NTRODUCTION
According to [Gad06] creativity is “the ability to use your
imagination to produce new ideas, make things etc.” and
imagination is “the ability to form pictures or ideas in your
mind”.
Recent advances in machine learning produce results which the
author would intuitively call creative. A high-level overview
over several of those algorithms are described in the following.
This paper is structured as follows: Section II introduces the
reader on a very simple and superficial level to machine
learning, Section III gives examples of creativity with images,
Section IV gives examples of machines producing textual
content, and Section V gives examples of machine learning
and music. A discussion follows in Section VI.
II. B ASICS OF MACHINE LEARNING
The traditional approach of solving problems with software
is to program machines to do so. The task is divided in as
simple sub-tasks as possible, the subtasks are analyzed and the
machine is instructed to process the input with human-designed
algorithms to produce the desired output. However, for some
tasks like object recognition this approach is not feasible. There
are way to many different objects, different lighting situations,
variations in rotation and the arrangement of a scene for a
human to think of all of them and model them. But with the
internet, cheap computers, cameras, crowd-sourcing platforms
like Wikipedia and lots of Websites, services like Amazon
Mechanical Turk and several other changes in the past decades
a lot of data has become available. The idea of machine learning
is to make use of this data.
A formal definition of the field of Machine Learning is given
by Tom Mitchel [Mit97]:
A computer program is said to learn from experi-
ence E with respect to some class of tasks T and
performance measure P, if its performance at tasks
in T, as measured by P, improves with experience E.
'
x0
x1
x2
x3
xn
w0
w1
w2
w3
wn . . .
(a) Example of an artificial neuron unit. x
i are the input signals and wi are weights which have to get learned.
Each input signal gets multiplied with its weight, everything gets
summed up and the activation func- tion ' is applied.
(b) A visualization of a simple feed- forward neural network. The 5 in-
put nodes are red, the 2 bias nodes are gray, the 3 hidden units are
green and the single output node is blue.
Fig. 1: Neural networks are based on simple units which get
combined to complex networks.
This means that machine learning programs adjust internal
parameters to fit the data they are given. Those computer
programs are still developed by software developers, but the
developer writes them in a way which makes it possible to
adjust them without having to re-program everything. Machine
learning programs should generally improve when they are fed
with more data.
The field of machine learning is related to statistics. Some
algorithms directly try to find models which are based on well-
known distribution assumptions of the developer, others are
more general.
A common misunderstanding of people who are not related
in this field is that the developers don’t understand what their
machine learning program is doing. It is understood very well
in the sense that the developer, given only a pen, lots of paper
and a calculator could calculate the same result as the machine
does when he gets the same data. And lots of time, of course. It
is not understood in the sense that it is hard to make predictions
how the algorithm behaves without actually trying it. However,
this is similar to expecting from an electrical engineer to
explain how a computer works. The electrical engineer could
probably get the knowledge he needs to do so, but the amount
of time required to understand such a complex system from
basic building blocks is a time-intensive and difficult task.
An important group of machine learning algorithms was
inspired by biological neurons and are thus called artificial
neural networks . Those networks are based on mathematical
functions called artificial neurons which take n 2 N num-
bers x1;:::;x n 2 R as input, multiply them with weights
w1;:::;w n 2 R, add them and apply a so called activation
function ' as visualized in Figure 1(a). One example of such
an activation function is the sigmoid function '(x)= 1 1+e x.
Those functions act as building blocks for more complex
systems as they can be chained and grouped in layers as
visualized in Figure 1(b). The interesting question is how
the parameters wi are learned. This is usually done by an
optimization technique called gradient descent . The gradient
descent algorithm takes a function which has to be derivable,
starts at any point of the surface of this error function and
arXiv:1601.03642v1 [cs.CV] 12 Jan 2016
Describe the bug borb extracts chinese characters only from a document that doesn't contain any chinese characters at all
To Reproduce Get this PDF: https://arxiv.org/pdf/1601.03642.pdf and store it as
1601.03642.pdf
Expected behaviour Get this (or something similar): https://github.com/py-pdf/benchmarks/blob/main/read/extraction-ground-truth/1601.03642.txt
Desktop (please complete the following information):
borb==2.0.25
Additional context
I'm running a benchmark for text extraction of various libraries. Currently, the quality and speed of borb is by far the worst: https://github.com/py-pdf/benchmarks#text-extraction-quality