comment extraction - highlights are offset to the beginning of the file, parts of words missing

annargrs commented 6 years ago

Here is a sample file. It has 4 highlights and comments on them, made by 3 different editors: adobe reader and XoDo (android), master pdf editor (Linux).

The extracted highlights all start earlier than they should and end earlier than they should.

Also, some symbols are missing, looks like ends of lines are particularly problematic ("t" instead of "text,").

This is not one weird pdf, I have this problem in many files - especially the missing symbols problem :(

Zotfile output:

Extracted Annotations (2018/4/13 17:34:18)

acrobat reader (note on p.11)

"Probabilistic topic models are a popular tool for the unsupervised analysis of t providing both a predictive model of future text and a latent topic representati of the corpus. Practitioners typically assume that the latent space is semantic a meaningful. It is used to check models, summarize the corpus, and guide ex ration of its contents. However, whether the latent space is interpretable is in n of quantitative evaluation. In this paper, we present new quantitative methods measuring semantic meaning in inferred topics. We back these measures w large-scale user studies, showing" (Chang, Boyd-Graber, Gerrish, Wang and Blei :11)

"1 Introduction" (Chang, Boyd-Graber, Gerrish, Wang and Blei :11)

buzz xodo (note on p.11)

"approxiate posterior inference, we can use topic models to discover both the topics and an assignm of topics to documents from a collection of documents. (See Figure 1.) These modeling assumptions are useful in the sense that, empirically, they lead to good models [of - missing] documents. They also anecdotally lead to semantically meaningful decompositions of them: top tend to place high probability on words that" (Chang, Boyd-Graber, Gerrish, Wang and Blei :11)

master pdf editor (note on p.11)

"Pay no attenti o n to the latent space behind the model Although we focus on probabilistic topic models, the field began in earnest with latent sema analysis (LSA) [6]. LSA, the basis of pLSI's probabilistic formulation, uses linear algebra to dec e Shape of Cinema, Multiplex SA originated in the psychology commun early evaluations focuse on replicating human performance or judgments using LSA: match play, film, t the Click of sense distinctions, and" (Chang, Boyd-Graber, Gerrish, Wang and Blei :12)

master pdf editor (note on p.12)

annargrs commented 6 years ago

Update: poppler (on Mac) copes with these highlights much better: no weird line offset, letters do not disappear at random (although it does handle hyphenated words as well as pdf.js). Is there a way to use poppler on Ubuntu? The option is currently just grayed out.

jlegewie commented 6 years ago

Thanks for sending the file. Unfortunately, I won't work on zotfile in the near future but it's useful for someone who picks up the work on pdf extraction.

The poppler-based solution was never compiled for any other system than mac.

annargrs commented 6 years ago

That's very sad to hear. I hope you won't abandon ZotFile entirely!

What would be easier to do - to try to fix your fork of pdf.js, to bring in the newer pdf.js, or poppler to Linux? Is it is the poppler from poppler.freedesktop.org, which is already used in some Linux programs such as evince? It even exists as Ubuntu package.

jlegewie commented 6 years ago

Here is the poppler-based script if you want to give it a shot. I would try it by first compiling one of the poppler example scripts and then try to reproduce the compiling process with this script. But I can't provide support beyond this. I haven't done it for almost 10 years.

main.txt (this is a cpp file)

jlegewie / zotfile

comment extraction - highlights are offset to the beginning of the file, parts of words missing #350