sumatrapdfreader / sumatrapdf

SumatraPDF reader
http://www.sumatrapdfreader.org
GNU General Public License v3.0
13.67k stars 1.73k forks source link

Ability to export annotations #2084

Open ribtoks opened 3 years ago

ribtoks commented 3 years ago

Hi

Thanks for developing Sumatra PDF reader. I was very excited to finally get PDF annotations released in version 3.3. Thank you for the hard work!

One feature that I'm missing though is the ability to export the annotations - in whatever format possible (e.g. .txt). I'm using this to make notes from the book and save them separately. Later I might make Anki flashcards from the notes or just save them in my notebook. There are examples of other software that can do that under Linux (for example, Foliate that can export annotations as HTML, markdown or plaintext) and I'm missing this from SumatraPDF.

Would be incredible to see this feature!

GitHubRulesOK commented 3 years ago

@ribtoks Without a physical example file, this issue would need to be closed since there have been so many changes in annotation handling since 3.3

[LATER EDIT] My Mistake, wrong issue quoted, this topic is an enhancement request

ribtoks commented 3 years ago

@GitHubRulesOK Thank you for fast reply, but unfortunately your reply does not correlate with my question.

I have a PDF file. Any of them with text. I create a highligh: I select text, press "a" or right click and create a highlight of the selected text. Then I would like to export all pieces of text that I have highlighted to a file, say "highlights.txt" (exact way/format does not matter, the export function is what matters).

As for your answer, I never mentioned PDF files with an attachment. Would you mind to double check my question?

Thanks again for answering so fast.

GitHubRulesOK commented 3 years ago

OK my bad Annotations are used for holding exportable text files as well as other file types exporting all annot comments as text content is an alternate usage, I forgot as not a prior ability. saving all comments was not a feature in basic acrobat reader 9, but may be found in more recent PDF editors

ribtoks commented 3 years ago

@GitHubRulesOK Ok, thanks for the clarification. I hope highlights exporting will be implemented somehow. Let me know if I can help.

GitHubRulesOK commented 3 years ago

@kjk as you know all too well there are multiple ways a user can add textual content Annotation can carry embeded text files (a separate open issue, I mistook for request)

Annotation can carry extensive screen text as "tooltips" without a visible object (a recent open issue) Annotation can carry visible "free text" (related to, but not this issue) Annotation either as icon or highlight can "pop-up" comments either via tooltip or editor box There are others

In this case the requirement is to export at least the later group to a text file, a feature of collecting and tagging page comments for fresh export that would as a minimum require re-collating such objects into page order. The most likely request if such an ability is built will be to sort into page order by means of using/showing negative Y offsets

kjk commented 3 years ago

I plan on enabling JavaScript bindings for operating on PDF files, like in mutool. This could be implemented as a JavaScript program.

GitHubRulesOK commented 3 years ago

@kjk Just a word of caution that if scripting actions are enabled that unlike MuPDF and older Acrobat (where you need to remember to deactivate auto running), I feel the default should be OFF and a manual step be provided to activate on a per use basis.

xh542428798 commented 3 years ago

Is there any plan to develop the feature that save annotations separately?

GitHubRulesOK commented 3 years ago

@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)

kjk commented 3 years ago

@ribtoks Could you propose how this format would look like? I assume the output would be then used in some way. I could copy foliate (but it's epub reader, not PDF reader so might not translate 100%). What information should be included? Just the text of annotation? They type (highlight, underline etc.)? Should include page number / position on the page?

I could export in some simple text format, e.g.:

annotation 1
---
second annoation
---
third annotation

Or in json:

[ 
  { "text": "annotation 1", page: 17, .... },
  { .... }]
GitHubRulesOK commented 3 years ago

Just to kick off Here is the most basic output from Xchange (I deliberatly kept it simple but it should convey more)

////////////////////////////////////////////////////////////////////////////////////////////////////
// Summary of comments on MyOutput _[note this is a .pdf]_
////////////////////////////////////////////////////////////////////////////////////////////////////

Page: 1
----------------------------------------------------------------------------------------------------
Page: 1
Type: Ink  Author: <None>  Subject: <None>  Date: <None>

Page: 1
Type: FreeText  Author: K  Subject: <None>  Date: 2021-06-20, 04:05:05
Hello World!

Page: 1
Type: Highlight  Author: K  Subject: <None>  Date: 2021-06-20, 22:05:04

    Type: Text  Author: K  Subject: Sticky Note  Date: 2021-09-10, 21:38:00
    what is the context here there is no copied content use SHIFT A next time

So note as requested by OP it has the content for FreeText but not any content from Highlight which was desired . In no case is there a hint of page position (X,Y,dx.dy) nor colour coding as may be visually added for author or subject grouping. Those may be exported by means of an [X]FDF file but that's way more complex as its similar to the PDF page input

xh542428798 commented 3 years ago

I am sorry I didn't describe clearly, most time I don't want to save annotations in the original pdf file, so I wonder is there any way can save annotations out of pdf files, when I open a pdf, it can load corresponding annotation file same time. The annotation file format, I think it could be a json? I know it is a big change for a software, maybe sumatra can give a choice, thanks a lot. such as : image

@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)

ribtoks commented 3 years ago

@kjk Thank you for working on this issue.

Goals

First of all, I'd like to remind the whole idea why I need it:

  1. exporting highlights to external text editor / notebook (think Joplin, Evernote) to have the gist of the book
  2. Creating Anki cards from some of the highlights from (1)

Format ideas in plain text

  1. As for "simple text format", it can be plain text or Markdown (preferred for me). The information I need there are only highlights and notes, chapter title, subheading title (if there's one), maybe page number. There's no need to know the type of annotation or other technical details in this "simple" text format.

  2. Additionally, json be also quite convenient, it can contain more technical information so that exporter can run some sort of jq template over it and make "simple text format" from (1) themselves. In my eyes this could have been a step 2 of extending "simple text format" from 1.

Examples

I'd like to provide an example of "simple text format", because for json you will know better what "properties" do you have for annotation and for json you can just dump all of them.

---

#### Chapter 3: How to do XYZ - yellow - p. 123

> Here goes the actual quote: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam

---
GitHubRulesOK commented 3 years ago

@ribtoks Your description is good but PDF does not have such construction as paragraph or chapter unless defined by human eye. The highlight has a highly complex structure that can be considered as an overlay above a single page of content thus does not know about the underlying text or "chapter" only its co-ordinates on the page. It is the user that inserts the comments (unless auto copied at time of highlight). Thus what can be garnered for listing is limited to:-

There are two prescribed methods for programmable extractions in several ways but are not of value to a human reader as they are designed to EX-port the overlay the norm is FDF and the xml version is XFDF but they then need conversiion into Json / XML using complex decoding as an example the xml one for a few comments would run to pages and look like

<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="../../../MyData/out4.pdf"/>
<ids original="C15E80D5BDF0828DC94C16477298D2DF" modified="7FFD0396F1FFF203EB9161E04C09935B"/>
<annots>
<ink page="0" flags="print" name="2d66392a-0430-4d28-815f6c00f6404f8a" rect="327.053986,595.205017,595.440002,771.515991" color="#004DE6">
<inklist><gesture>461.785004,761.403992;461.785004,761.531006;461.785004,761.828003;461.785004,762.286987;461.657013,762.877014;461.359985,763.559021;460.901001,764.171021;460.183014,764.771973;459.075012,765.392029;457.705994,765.916992;456.183014,766.432983;454.585999,766.982971;452.970001,767.585022;451.234985,768.111023;449.480011,768.63></inklist>
</ink>
<freetext intent="FreeText" IT="FreeText" title="K" page="0" date="D:20210620030505Z" flags="print" name="a104f142-9195-4156-85844ebed3dc4daa" rect="359.740875,605.969116,559.740845,705.969116" width="0">
<contents>Hello World!</contents><defaultappearance>/Helv 30 Tf 1 0 0 rg</defaultappearance></freetext>
<highlight coords="39,674.669983,198,674.669983,39,661.669983,198,661.669983,18,662.669983,306,662.669983,18,649.669983,306,649.669983,18,650.669983,104,650.669983,18,637.669983,104,637.669983" title="K" page="0" date="D:20210620210504Z" flags="print" name="0631dd8c-bae4-46f5-b1841f9c9e5d16e3" rect="14.935769,636.857483,309.06424,675.482483" color="#FF00FF"/>
<text icon="Comment" inreplyto="2d66392a-0430-4d28-815f6c00f6404f8a" title="K" creationdate="D:20210910212148+01'00'" page="0" date="D:20210910212148+01'00'" flags="hidden,print,nozoom,norotate" name="3e3d1483-529e-4b7c-b6c64eeda95dd864" rect="100,102,120,120" color="#FFFF00"/>
</annots>
</xfdf>

So not human readable XML as one might expect

ribtoks commented 3 years ago

@GitHubRulesOK In such case it's best to keep the "simple text format" simple (I mean no need for coordinates, rgb value or annotation type)

Something like

---
> quote here
(p. 123)
---

Will work both for .txt and .md.

GitHubRulesOK commented 3 years ago

@ribtoks again i agree with the sentiment keep it stupidly simple, however experience of others desires suggests the XY position within a page of multiple entries may aid in back searching such as used by LaTeX synctex or other programmable recall so goto highlight on page 10 half way down is

SumatraPDF -page 10 -zoom "fit width" -scroll 50,500 -reuse-instance MyFavorite.pdf

so colour export is of less value compared to rect upper left which is desirable

ribtoks commented 3 years ago

@GitHubRulesOK My opinion is that XY coordinates might work well in a structured format like json. For simple text (human use) there's no point to provide XY - nobody will calculate on their own where is the highlight.

kjk commented 3 years ago

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

To extract annotations:

Current limitations:

Give it a try and let me know how it can be improved.

It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.

JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.

xh542428798 commented 3 years ago

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

To extract annotations:

  • drop a PDF file on the gray area
  • click 'extract annotations'
  • when it's done you can see JSON and text version in a text area below
  • when you switch between the version, they are also copied to clipboard so that it's easy to Ctrl-V into a text editor

Current limitations:

  • for highlight / underline etc. annotations there is no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself. it's possible to recover it, so I'll try to add it, but not today

Give it a try and let me know how it can be improved.

It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.

JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.

Amazing!Love you!Can it build in program and display the annotations once upon pdf and annotation json file loaded in program?

GitHubRulesOK commented 3 years ago

no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself

That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V

I suggest SHIFT+A or another key pair could auto include the text

GitHubRulesOK commented 3 years ago

I like the extra info in the JSON but too complex to use simply to parse a rect and conversely the txt output des not give any clue as to annotations place on a page which may often not be in -Y order

ribtoks commented 3 years ago

no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself

That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V

This is the whole point of this feature - to have the text that was highlighted by default. Otherwise it makes absolutely no value to have only the coordinates of annotations.

kings2u commented 1 year ago

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

@kjk would you consider open sourcing this? I would love to be able to use it locally. Thanks!

GitHubRulesOK commented 1 year ago

@kings2u As a html function its complex Google reputedly liberated SUN JS from Oracle and the core PDF handling is Copyright 2021 Mozilla Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 but then again parts are MIT and Parts are not, FOSS is a minefield due to CopyWrongs

Geobert commented 1 year ago

I’ve skimmed this thread but I’m unsure my usecase is covered so here it is: I have a PDF of a RPG rules. Some errata are issued by the creator. I annotate my PDF with the erratas and want to share my annotations with the community, but I’m not allowed to share the PDF, obviously.

So I’d like to export my annotations and with someway for other owners of the same PDF to import my annotations to have the erratas in their own copy.

GitHubRulesOK commented 1 year ago

@Geobert The export of annotations e.g. comments between users is called collaborative review or similar, Adobe are masters at providing corporate solutions that cover their products, but Reader is able to do that when backed up by more powerful Adobe suites.

A good editor for exporting comments is Tracker PDFXedit and even Foxit reader may have similar abilities, but neither may offer all Adobes review features.

For simple edits such as text you can export an FDF file with just the comments and a user can add their master copy as text then open the FDF will over stamp the PDF.

let me see if i can mock up an example.

Geobert commented 1 year ago

@GitHubRulesOK Thanks for your answer! I tried PDFXedit, it exports the comments fine into an FDF but when opening this FDF, it wants to open the annotated version. The import button is grayed out.

GitHubRulesOK commented 1 year ago

@Geobert

So I PDF this book :-) and add annotation to my copy then export as FDF

image

here is the file with the name of my copy

%FDF-1.4
%âãÏÓ
1 0 obj
<<
/FDF <<
/Annots [2 0 R]
/F (2beExportImported.pdf)
/UF (2beExportImported.pdf)
>>
/Type /Catalog
>>
endobj
2 0 obj
<<
/BS <<
/Type /Border
/W 1
>>
/Contents (This is a text... written in SumatraPDF as a demonstration)
/DA (/Helv 12 Tf 0 0 0 rg)
/F 4
/M (D:20230925140856Z)
/NM (ed50503f-4305-4e5a-acbc1c4266d7fb78)
/Page 0
/Rect [393.37059 559.2384 593.3706 659.2384]
/Subtype /FreeText
/T (lez)
/Type /Annot
>>
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF

so we edit to what we expect a user copy to be lets say it is "import.pdf" image and send it to you (in reality Acrobat reader does not change the name as we would be using same filename)

So I open the FDF by double click or Acrobat file open (SumatraPDF based on MuPDF does not have that ability)

Acrobat reader says that the FDF wants to write over the name i supplied "import.pdf"

image image

Geobert commented 1 year ago

Oh I see, one needs to edit the FDF! Thanks!

EDIT: the PDF is locked for edition, so it seems it can’t be done. I though it could because we can annotate such locked PDF, but the import of annotation seems to not work :-/

GitHubRulesOK commented 1 year ago

@Geobert hmm locked is an problem (ensure file is not in use) in many ways by Adobe (protection is worthless in other readers) and most main stream FDF apps will often be compatible with Adobe restrictive DRM practice! So Users would need to use an unlocked copy (plenty of web sites charge or free for the service) MuPDF and other tools such as qpdf can remove the restrictions easily. but it rewrites the source file. Not a problem for FDF as its overlay on page numbers.

DD318 commented 2 months ago

Hi, Any progress on this? Is there a way to export annotations made on a pdf as a separate txt file?

GitHubRulesOK commented 2 months ago

@DD318

Hmm the experiment mentioned above is no longer available ?

Here is a sample from this page image

JSON is a very bloated format but if that's what you want there is Coherent cpdf and pdfcpu that use that format. Output from cpdf -list-annotations-json

  [
    1,
    53,
    {
      "/DA": { "U": "/Helv 12 Tf 1 0 0 rg" },
      "/BS": { "/Type": { "N": "/Border" }, "/W": { "I": 0 } },
      "/Rect": [
        { "F": 124.10363 },
        { "F": 457.0187 },
        { "F": 324.10365 },
        { "F": 557.0187 }
      ],
      "/Subtype": { "N": "/FreeText" },
      "/Type": { "N": "/Annot" },
      "/P": { "I": 1 },
      "/F": { "I": 4 },
      "/M": { "U": "D:20240909154734Z" },
      "/T": { "U": "K" },
      "/Contents": { "U": "DD318 here is an FDF example?" },
      "/AP": { "/N": 57 }
    }
  ],

If you just want the comments then use the lesser form (no details about position or colour etc.)

cpdf -list-annotations "page15*.pdf"
Page 1: DD318 here is an FDF example?

Here is an output from pdfcpu that is simpler but basic unless you query the object ( Here Obj 53 does not show colour etc. just the text, but -verbose logging is overwhelming)

optimizing...
12 annotations available

Page 1:

  FreeText:
     Obj# │ Id │ Rect                 │ Content
━━━━━━━━━━┿━━━━┿━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━
       53 │    │ (124, 457, 324, 557) │ DD318 here is an FDF example?

  Link:
     Obj# │ Id │ Rect                 │ Content
━━━━━━━━━━┿━━━━┿━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━

**7 entries removed to reduce space in this my comment**

       32 │    │ (176, 134, 195, 153) │ https://github.com/DD318
       33 │    │ ( 53, 798, 102, 813) │ https://github.com/Geobert
       34 │    │ ( 53, 619,  84, 633) │ https://github.com/DD318
       35 │    │ (146, 619, 212, 633) │ internal link

C:\Users\K\Downloads\SO\pdfcpu\pdfcpu_0.8.1_Windows_i386\pdfcpu_0.8.1_Windows_i386>

For import and export annotations Adobe developed FDF which looks like PDF but without the page contents just the overlay.

Actually the only one I could find in a hurry online is Apryse XFDF output but there are many other methods.

<?xml version="1.0" encoding=""?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
  <pdf-info xmlns="http://www.pdftron.com/pdfinfo" version="2" import-version="4"></pdf-info>
  <fields></fields>
  <annots>
    <freetext TextColor="#FF0000" width="0" flags="print" date="D:20240909154734Z" name="55deabbf03673f70-13b5a23648e7aaec" page="0" rect="124.1036,457.0187,324.1037,557.0187" title="K">
      <defaultstyle>font: Helvetica 12pt;color: #FF0000</defaultstyle>
      <defaultappearance>/Helv 12 Tf 1 0 0 rg</defaultappearance>
      <contents>DD318 here is an FDF example?</contents>
      <apref y="557.0187" x="124.1036" gennum="0" objnum="53"></apref>
    </freetext>
  </annots>
  <pages>
    <defmtx matrix="1,0,0,-1,0,841.92"></defmtx>
  </pages>
</xfdf>

MuPDF Mutool has several options. but it would need scripting to reduce the native output so here I have the one page and ask for that pages data where there are many /Annots but I only want the ones with comments so it is that previously mentioned number 53

image

53 0 obj <</Type/Annot/Subtype/FreeText/Rect[124.10363 457.0187 324.10365 557.0187]/BS<</Type/Border/W 0>>/DA(/Helv 12 Tf 1 0 0 rg)/P 2 0 R/F 4/M(D:20240909154734Z)/T(K)/Contents(DD318 here is an FDF example?)/AP<</N 57 0 R>>>>
DD318 commented 2 months ago

@GitHubRulesOK thank you for explanation!

lukaszjablonski commented 6 hours ago

Since XFDF was mentioned above, is there any plan to add export of annotations to XFDF?

GitHubRulesOK commented 4 hours ago

@lukaszjablonski The import/export of FDF and XFDF are Adobe Specific methodologies from when they bought out that ability decades ago. Most specifically Adobe Readers use PDF according to their own specifications not always per ISO Standards. so the "spec" was in theory per "Adobe technical note XML Forms Data Format Specification, Version 2.0." Thus not every 3rd party may do it exactly the same. Although now we have ISO standards for PDF 2.0 it should be possible to write one fairly common format. But then again XFA which was the prime user of XFDF is "depreciated" as not ISO standard.

Personally I use Tracker PDF Editor for that functionality but it is NOT always possible even with their extensive software to emulate Acrobats Proprietary formats. It would be akin to exporting XMLX format such as DocX using one of the many clones that cannot achieve 100% compatibility with MS Word compressed XML as used in 365.docX.

Converting FDF to XFDF should in theory be easy but why add so much more textual baggage (doubled tag wrapping) when FDF is more compact and compatible as PDF language and reputedly has greater abilities?

From the withdrawn and now replaced standard:-

FDF is a simplified version of PDF. PDF and FDF represent information with a key/value pair, also referred to as an entry. This example shows the T and V keys with values enclosed in parentheses:

   /T (Street) /V (345 Park Ave.)

XFDF, on the other hand, represents an entry with an XML element/content or attribute/value pair, as shown in the corresponding XFDF:

<field name=”Street”>  
<value>345 Park Ave.</value>
</field>

XFDF implements a subset of FDF containing forms and annotations. There are XFDF equivalents for the Annots, Fields, F, and ID keys of the FDF dictionary. There are no XFDF equivalents for the other entries in the FDF dictionary such as the Status, Encoding, JavaScript, EmbeddedFDFs, Differences, Target, and Pages keys.

Newer version includes some less useful parts

ISO 19444-1:2019(en) Document management — XML Forms Data Format — Part 1: Use of ISO 32000-2 (XFDF 3.0) second edition cancels and replaces the first edition (ISO 19444-1:2016), which has been technically revised. The main changes compared to the previous edition are as follows:
— Addition of 3D comment related elements and attributes; — Stream encoding information for XFDF. A list of all parts in the ISO 19444 series can be found on the ISO website.

lukaszjablonski commented 3 hours ago

@GitHubRulesOK, fair! Thank you for further explanation.

Nevertheless, my question is whether exporting annotations to file (let's say FDF) from SumatraPDF is in scope of its development?

GitHubRulesOK commented 3 hours ago

@lukaszjablonski AFAIK MuPDF the core of SumatraPDF does not make forms import & export easy other than using the more comprehensive "Python" variant PyMuPDF with significant system dependencies. So I dont see any mention of such in https://pymupdf.readthedocs.io/en/latest/search.html?q=FDF&check_keywords=yes&area=default

but several suggestions in https://stackoverflow.com/questions/1106098/parse-annotations-from-a-pdf

That is not to say the situation could change at any time, if somebody finds an easy method to match Adobe FDF methods. but it is not simple, hence few simple PDF reader applications to use FDF format apart from Acrobat DC.

The largest competitor to Adobe is probably Apryse and they have several products that talk in FDF language such as iText or PDFTron

You can use the historic (problematic) PDFtk but may need to enhance those text files.

You can use SumatraPDF to call PDFtk but would need to edit the output in a text editor.