izuzak / atom-pdf-view

Support for viewing PDF files in Atom.
https://atom.io/packages/pdf-view
MIT License
106 stars 30 forks source link

Adds text layer and copy&paste #94

Open mar29th opened 8 years ago

mar29th commented 8 years ago

This adds a text layer on top of the original PDF canvas, along with a preliminary solution for copy and paste.

Changes made:

About copy & paste: The fragmented nature of text contents passed by Page.getTextContent() makes it hard to find a universal way to copy flawlessly. So far, when copying multiple lines, the native document.execCommand('copy') is used. This won't preserve formatting if copy destination is in Atom, but works when copying to Word or TextEdit. There is also a select() method that will add a line break after the contents that are approximately on the same line. Currently it's commented out since it only makes copied contents looks nicer within Atom. While the method would work for strictly formatted PDFs (e.g. TeX converted), it does not function very well for some irregularly formatted documents, or when word separation is purely done by pixel arrangements :joy:

izuzak commented 8 years ago

Wow, @lafickens, this is amazing -- thanks so much for making a pull request, I'd love to get this into pdf-view. :zap:

I was just testing this and noticed a few things:

  1. This seems to break synctex support which was added in https://github.com/izuzak/atom-pdf-view/pull/87. Basically, with synctex support -- when you click on a line in a PDF generated from tex source, that tex source file is opened and the cursor is put on the exact line in the source code which was clicked in the PDF. On this branch, clicking on a line of text doesn't do anything, I'm guessing because of the things you added for handling selections. Before we merge this, I think we should make sure it doesn't break any of the existing functionality. Could you look into that when you find some time?
  2. As you mentioned, it's hard to make this work well for all PDFs. I noticed that in a lot of cases (but not all) when I use the mouse to start selecting from the start of the line to the end of the line -- selection stops about 4-5 characters before the end of the line. See this GIF:

    selecting

    I don't think this in itself is a blocker for getting this merged, but I would like to put this feature behind a config setting which would be disabled by default (until we're confident that it's working well and not causing any problems). That way, anyone can use this great feature, but they can easily enable and disable it if they want. What do you think?

Thanks again!

mar29th commented 8 years ago

@izuzak Oops I missed the syncTeX functionality when adding text layer. I will look into that.

I think adding an item in config for enabling text layer is a good idea. There is indeed some problem in positioning the text layer on top of pdf canvas, which is probably the reason for the peculiar behavior when selecting text.

There is however another possible reason to such problem - the font ascent and descent attributes are not properly set when generating the pdf. Although PDF.js developers claim to have solved the problem (see mozilla/pdf.js#4665), the problem still exists for some documents. I have encountered one such document. Download it and try opening it in Firefox (or PDF.js viewer) and select some text, you would see the text layer is about 20 pixels above the actual text in canvas... What I've discovered so far is that pdfs converted from LaTeX works pretty well in PDF.js.

izuzak commented 8 years ago

Just a heads-up -- I switched the package to use JavaScript in https://github.com/izuzak/atom-pdf-view/pull/98, so this branch is no longer mergeable. If you do continue working on this -- we can tackle the conversion to JS last, after we get things working. Thanks again! :bow: