mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.18k stars 9.95k forks source link

Selected text which appears to be normal has spaces occasionally missing. #1883

Closed dwhly closed 8 years ago

dwhly commented 12 years ago

Using the following PDF as reference....

http://www.nature.com/nature/journal/v486/n7402/pdf/nature11234.pdf

Selecting the abstract text and doing a copy paste into a text editor results in a string with spaces occasionally missing on whole lines... for example:

"Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signaturemicrobestovarywidelyevenamonghealthysubjects,withstrongnichespecializationbothwithinandamong individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range ofstructuralandfunctionalconfigurationsnormalinthemicrobialcommunitiesofahealthypopulation,enablingfuture characterization of the epidemiology, ecology and translational applications of the human microbiome

gigaherz commented 12 years ago

Depending on how the pdf contents are internally, this may actually be because there's no space character included in the internal string, just two letters with larger spacing than usual. In that case, it may be difficult to decide what's a space and what's normal kerning... other than using something like a k-means clustering algorithm with 2 sets, and try to separate the spacings into those two categories.

dwhly commented 12 years ago

So the information known about the original space when the PDF was created may have been destroyed-- or reduced to subtle spacing data. And yet, recovering the information that there's likely to be a space here is incredibly important for text selection, annotation (anchoring) and copy / paste actions.

Is this a problem that has been effectively solved elsewhere? (Adobe reader, Chrome's internal PDF reader, etc.)

D

gigaherz commented 12 years ago

Yes most PDF applications do a substantial amount of heuristic processing to recover the original text semantics from the internal data. Words, paragraphs, tables, ... it requires effort in code and experimentation though. This is an open-source project so you could decide to contribute this, if you are really interested in it and have the knowledge or the will to learn it. Otherwise, someone else will do it... someday.

dwhly commented 12 years ago

David, we'd love to help with this particular area, as getting good text selection back out of PDFs is essential for many use cases. I'm wondering if you know of any good literature around this area. Also-- we'd be willing to sponsor developers to fix this particular issue.

gigaherz commented 12 years ago

No I can't say I know much about the internals of PDFs, only know some basic concepts from having been around the project for a while reading issues & pull requests.

You may want to ask around on IRC (irc.mozilla.org, channel #pdfjs) or on the newsgroup (mozilla.dev.pdf-js)

tilgovi commented 12 years ago

Instead, we could explore options like headless pdf.js and annotator server-side with OCRopus to generate annotations, anchored to DOM, and publish them into an annotation namespace on the web, allowing updates to be published and, conceivably, even consumed back into ocropus training models.

What is still of interest to me is whether pdf.js is works (or can be made to work) with node-canvas, phantomjs or things like these. Other snags and bergschrund lie, of course, in getting the annotation anchoring robust enough to be portable between the clients (even headless ones), which is back in annotator land.

None of this is this ticket, though. So... I think you could probably re-close it.

jviereck commented 12 years ago

There is some work on the way to make text selection better than what is in right now. Mainly to do good text search and on the way redo some of the internals that hopefully make text selection better as well.

The idea of using OCR is very cool. However, we can't spend a few seconds to do an OCR on the rendered text, as that would be way too performance intense (think of mobile devices...). Therefore, we are forced to do a simple analysis on the structure of the document. If the content is not well structured (following the logical order of the document), it will be hard to make the text extraction work very well.

@dwhly, I don't have the time to figure out what line is exactly missing in the document. Could you be somewhat more precise which line is missing (in which paragraph, which line, what are the first words of the missing lines)? Would love to take a look at why the text selection is not working out there.

dwhly commented 12 years ago

Julian,

Sure. I just selected the opening abstract of the article, and did a copy and a paste here. You'll note of course that the concatenated words are in both lines where the kerning is particularly tight. I think this is quite representative of the problem. Solve it for this example, and hopefully it will address many others, particularly since this is such a standard, recent and well formed PDF from a major journal.


"Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signaturemicrobestovarywidelyevenamonghealthysubjects,withstrongnichespecializationbothwithinandamong individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range ofstructuralandfunctionalconfigurationsnormalinthemicrobialcommunitiesofahealthypopulation,enablingfuture characterization of the epidemiology, ecology and translational applications of the human microbiome."

dwhly commented 12 years ago

Just to be clear. There are no missing lines, only lines where when selecting, all words are concatenated.

jviereck commented 12 years ago

Just to be clear. There are no missing lines, only lines where when selecting, all words are concatenated.

Thanks for that clearification! I think that is also the case somewherein the Trace-Monkey paper, first or second page. The space detection algorithm used for the TextLayer is somewhat buggy. The search extraction algorithm, which is very similar, did things right. Hopefully this will be resolved by merging the two algorithms!

dwhly commented 12 years ago

@jviereck tx for this. Any insight into the priority of merging the algorithms? This is the top issue for us in terms of our ability to effectively use PDF.js as means enabling the annotation of PDFs. I would imagine that it's a fairly important usability issue for anyone wanting to do cut and paste actions from PDF text.

Also, is there already an issue assigned to this that I could follow? And what is the trace-monkey paper?

jviereck commented 12 years ago

I can tell you that search has a very high priority (as it's one of the features blocking to get enabled in Firefox by default), but can't tell you how high the priority to merge the algorithms is exactly.

edsu commented 11 years ago

FWIW, I just tried the PDF that @dwhly mentioned above with the online demo and the copy and paste seems to be improved, but not perfect:

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes thatoccupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet,environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize theecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohortand set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’ssignature microbes to vary widely even among healthy subjects, with strong niche specialization both within and amongindividuals. The project encountered an estimated 81–99% of the genera, enzyme families and communityconfigurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways wasstable among individuals despite variation in community structure, and ethnic/racial background proved to be one ofthe strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the rangeof structural and functional configurations normal in the microbial communities of a healthy population, enabling futurecharacterization of the epidemiology, ecology and translational applications of the human microbiome.

It seems that most of the errors now are right on end-of-line boundaries? It's interesting that these particular errors were working previously in the copy/paste text that @dwhly provided. So perhaps the select/search algorithm merge (if it happened) stomped on something that was working better in the previous select algorithm?

mitar commented 11 years ago

@tilgovi yes, pdf.js works quite nicely on the server in head-less mode. We are using this in our meteor-pdf.js module. Adding OCR to it would be great feature when opening pdf.js on the server side to allow better full-text search even for PDFs which lack text content themselves. And then we could even push that OCR text to the client to overlay for text selection.

Do you know of any OCR library for node.js?

tilgovi commented 11 years ago

Maybe shell out to OCRopus?

speedplane commented 9 years ago

PR #5783 should fix this.

timvandermeij commented 8 years ago

Fixed after recent text selection patches.