How to copy text from PDF without line breaks

sumatrapdfreader / sumatrapdf

SumatraPDF reader

http://www.sumatrapdfreader.org

GNU General Public License v3.0

13.33k stars 1.69k forks source link

How to copy text from PDF without line breaks #3305

Closed andreasvarga closed 1 year ago

andreasvarga commented 1 year ago

I am working on a book, which includes many code examples. For the ebook version of my book, ideally the code sequences copied and then pasted into the program (or any editor) should still be an executable code. In the preface of my book I am recommending SumatraPDF to be used in Windows for these operations. However, I needed to invest a lot of time to ensure this feature. The problem is the following:

Spaces in the displayed code, which are larger than one space, are automatically converted into line breaks after copying.

Here is a short code example, which is reformatted after copying it, thus leading to execution errors (texts following # are comments):

example_book_test.pdf

What is expected to be obtained after copying the code (beetween the two horizontal rules) is:

# Example 5.4 - Solution of an EFDP
println("Example 5.4")

# define s as an improper transfer function
s = rtf('s');
# define Gu(s), Gd(s),  Gf(s)
Gu = [(s+1)/(s-2);  (s+2)/(s-3)];     # enter Gu(s)

What is obtained by pasting into an editor (and of course as code into the processing program) is:

# Example 5.4 - Solution of an EFDP
println("Example 5.4")
# define s as an improper transfer function
s = rtf(’s’);
# define Gu(s), Gd(s),
Gf(s)
Gu = [(s+1)/(s-2);
(s+2)/(s-3)];
# enter Gu(s)

This leads to execution errors, because of line breaks. The only way I found to avoid them was to carefully elliminate in the code spaces, which are longer than one space. I would be very pleased if this behaviour with respect to spaces, would be as the expected one (so without line breaks).

Some other PDF viewers have the same behaviour (I can confirm only for the Adobe Reader and Firefox). One which works (almost) correctly is the Edge PDF viewer, which automatically replaces many spaces with just one. Its output, copied into an editor, is:

# Example 5.4 - Solution of an EFDP
println("Example 5.4")
# define s as an improper transfer function
s = rtf(’s’);
# define Gu(s), Gd(s), Gf(s)
Gu = [(s+1)/(s-2); (s+2)/(s-3)]; # enter Gu(s)

This code runs without errors! Can be this behaviour of Edge enforced also for SumatraPDF?

GitHubRulesOK commented 1 year ago

This is nothing to do With SumatraPDF the text is separate glyphs with big and small spaces see this as an example of searchable image which may be no different to any other type of text, from that publication. A Table extractor will try to add line wraps where the boxes are so in that case you can get different results but each text string is a line of text in its own context. see the graphical layout at the bottom.

NewDocument.pdf this file is readable in a text editor so you can see how the strings are composed as BT...ET fragments of the contents. The whole block (part of one page of a Science Journal) is called an Image /Im1 Do just like a PNG or JPG might be added into 5 0 obj (the page contents) Probably written on a Mac in MS Word, then compiled by PTEX (LaTeX ?) as an insetion called spectrum.pdf, and the first word or line of text is

BT
.0183 Tc
/TT2 1 Tf
58.33334 0 0 58.33334 0 2450 Tm
[(!")14(#)10.000004($)12(%)16(&)] TJ
ET

= Na t i v e

So that middle central string for "Containers" is like this

here is a text fragment from a PDF BT .017 Tc /TT2 1 Tf 58.33334 0 0 58.33334 0 2450 Tm [(+,)1(\()9.000004(#)22.000008(")3($)10.000004(\()-1(&)1(-)27(.)] TJ ET it starts Begin Text = BT and Text Ends with ET

Write text Fragments is a One way-streak you put a fragment but cannot get one back unless you unbake it so that +=C and ,=o and \(=n and #=t "=a, $=i and now we see\ (= n yet again so we are on to a winner the text is a "Container" but there is another number . which is the s at the end of the whole string for that part of the page but its separate from any other partial words.

SumatraPDF can decode the numerals into characters no problem but cannot waste time altering the spaces between characters or pairs or triplets , so if cut and paste has bigger spaces the clipboard will see those as line endings. It is not uncommon for OneCharReplace (OCR) to be seen as every single letter is a line in its own "write."

By far and away the best of ALL extractors for Plain Courier "programs" is "pdftotext" from Xpdf (or its poppler clone both the result of hundreds of man years development since the 90's) they can export a page as laid out with little or no addition other than adding indents and a hard carriage return at the end of a single text line.

pdftotext -layout -enc UTF-8 -nopgbrk FILENAME.pdf

Native     Linux      gVisor    Firecracker    KVM/QEMU
Linux   Containers  Containers  microVMs         Full VMs

Host Kernel                                     Guest Kernel

                    Location of Functionality

andreasvarga commented 1 year ago

I hoped, my wish to eliminate artificially introduced line breaks in the copied lines, would be easy to be fulfilled. I thought not of reformatting lines, but keeping the existing formatting if possible. I realize it was only a wishful thinking.

GitHubRulesOK commented 1 year ago

Every extraction application will have its own method of conglomerating the fragments, otherwise there is no point in having different applications, hence some profiles/aggregations/estimated inputs are good for one context and another for a different part of a PDF, basically PDF writing destroys all human input in converting keyboard entries into a bytestream with
BT niceties ET
but mahoosive overheads of its own !! Portable format (not really)

avorobey commented 1 year ago

@andreasvarga I tried to look into this, and there's no easy fix. But there might be something you can do on your side, depending on how you produce the PDF.

The line breaks that are annoying you are generated by MuPDF, the library SumatraPDF uses for rendering PDFs. They are "a feature, not a bug", in the sense that there's special code that inserts them based on how much empty space there is between subsequent characters (as you noticed by experimenting). This code uses heuristics with hardcoded limits; the code is around here. Based on the amount of whitespace, the code will start a new line or even a new paragraph. I think MuPDF developers consider it a feature. The fact that Edge's browser will collapse all white space to one space is in a sense a misfeature, because lots of PDFs with visually quite a large gap are not represented well, in text, by a single space.

It's possible that MuPDF should be smarter about horizontal-only white space and do things like "insert several space characters instead of starting a new line". If you'd like that, your best bet is probably opening a bug on their bug tracker.

However, consider also how your PDF is produced. Whatever software you use for producing it (I don't know what it is), deliberately throws away spaces (as font characters) and instead positions words individually and precisely. For example, consider this line from your file: "# define Gu(s), Gd(s), Gf(s)". Let's focus on just this part: "Gu(s), Gd(s), Gf(s)" which produces a line break. If you examine the PDF's low-level drawing instructions, you will see something like this:

[-600(Gu(s),)] TJ [-600(Gd(s),)] TJ [-1200(Gf(s))] TJ

TJ is the command to draw text. -600 or -1200 means "move right by this number of a thousandth of a point". So your PDF is instructing the viewer to advance some amount, output "Gu(s),", then advance again, output "Gd(s),", then advance twice as much, and output Gf(s). The latter advancement exceeds the limit in MuPDF source code, and it inserts a line break.

(you can see these details by running "mutool clean -d example_book_test.pdf" which produces "out.pdf" with these commands in clear text. mutool is a program which comes with MuPDF, not part of SumatraPDF. Also, I lied and simplified a lot above, to make my point clear. The real commands isolate "(", ")" and "," characters into their own TJ commands, but without any special advancement; and there are lots of irrelevant color-setting commands k,K,g,G in-between).

But if instead of all this your PDF just said

[(# define Gu(s), Gd(s), Gf(s))]TJ

using one or two spaces in the text instead of -600 or -1200, then I think it would've worked fine. I think MuPDF's heuristic code wouldn't fire because the space character would render (invisibly) and there'll be no large gap, like -1200 in your file. I didn't test this, but it seems like it should work.

The software producing your PDF replaces spaces with exact positioning, and maybe you need that and it's the thing you want; but maybe just using spaces and relying on their widths in the font you use will work just as fine, if you can convince your software to do that.

Hope that helped, Anatoly.

andreasvarga commented 1 year ago

Thanks for the ample response.

The pdf-file is generated from a tex-file via the pdflatex tool, embeded in the TEX distribution MiKTex. I was not able to figure out if there is any possibility to influence the generation of the pdf-file. Moreover, the final pdf-file for the ebook version will be generated by the publisher (Springer), so I have practically no control on the process. This is why, my first idea was to find out if this issue can be addressed at the level of the pdf-viewer I am currently using (also recommended in the preface of my book), which, as you confirmed, inserts line breaks for spaces exceeding one blank.

Therefore, I am trying to "survive" with the existing restrictions. This issue affects only the generated code listings. Incidentally, for the language in question (Julia), the texts starting with the character "#" for comments preceded by blanks, become executable statetements after inserting a line break. Thus, copy-paste operations from the pdf-file into the Julia command interpreter usually work, unless there are more than one blank in comments, in which case, fail. So, I replaced manually all spaces exceeding one blank with exaclty one blank and things work now as desired. Unfortunately, I lost some formatting within comments, which served to emphasize some structure in the data.

Just a comment on this discussion: the decision to insert a line break in case of spaces exceeding a certain length, seems to me a little bit exagerated (impeding clean copy-paste operations). Would it be not simpler, instead inserting a line break, to insert blanks, in a number which covers the spaces between words? Anyhow, once again, many thanks for considering the issue.

GitHubRulesOK commented 1 year ago

@andreasvarga the use of PDFLaTeX is much like OCR it intentionally is character placements not words as say from a word processer, thus the two are natively poor candidates for retro combining into words (that require dictionaries with syntactic allowances) especially difficult with mathematic / scientific Jargon and abbreviations.

The simplest best result for a body of plain text (often including cursive Arabic/Persian etc.) is to use as above PDFtotext with the Layout setting that visually ignores small gaps and injects big ones as one line per line.

However Plain texT is not Plain TeX and formulaic content got thrown out of the pram / Window when inserted into the PDF

avorobey commented 1 year ago

@andreasvarga I think I agree with you on "the decision to insert a line break in case of spaces exceeding a certain length, seems to me a little bit exaggerated". But I'm also mindful of not understanding the intricacies of this decision on the MuPDF side. The algorithm they have works for strings written in any direction (not necessarily horizontally), so they can't don't use something as simple as "continues on the same horizontal line". Maybe they had a good reason, after testing against many PDFs, to prefer making line breaks and paragraph breaks (they do insert a space under some conditions - but only if the previous character wasn't a space, so they never do a sequence of spaces). Maybe they didn't have a good reason and it was just easier and seemed less brittle than trying to estimate the no. of spaces. I don't know, but I think that SumatraPDF will continue to defer to MuPDF on this unless somehow this issue grows in importance wildly.

I suggest filing a bug on MuPDF (https://bugs.ghostscript.com/, product: MuPDF, component: fitz). I don't know if it'll be attended to, but the cost is small (you've already invested in writing a careful description of the issue with a test file), and it's good to let the developers know of the problem.

And best of luck with your book!

GitHubRulesOK commented 1 year ago

@avorobey SumatraPDF is not only MuPDF user their primary commercial (and free) concerns are programming users such as PyMuPDF where a word block or character block XY box is of import. Personally agree there should be better conglomeration of simple lines but there are many PDF simplistic extractors that can work by a region of interest or line by line. The biggest gripes are not being able to detect the missing line wrap and any means of determining which lines are of more import with indentation like a paragraph start or end. However modern trend is no indents and justified lines without end of paragraph pilcrows.

avorobey commented 1 year ago

@GitHubRulesOK you're right, good point re: other users of MuPDF.