Open ribtoks opened 3 years ago
@ribtoks
Without a physical example file, this issue would need to be closed since there have been so many changes in annotation handling since 3.3
[LATER EDIT] My Mistake, wrong issue quoted, this topic is an enhancement request
@GitHubRulesOK Thank you for fast reply, but unfortunately your reply does not correlate with my question.
I have a PDF file. Any of them with text. I create a highligh: I select text, press "a"
or right click and create a highlight of the selected text. Then I would like to export all pieces of text that I have highlighted to a file, say "highlights.txt"
(exact way/format does not matter, the export function is what matters).
As for your answer, I never mentioned PDF files with an attachment. Would you mind to double check my question?
Thanks again for answering so fast.
OK my bad Annotations are used for holding exportable text files as well as other file types exporting all annot comments as text content is an alternate usage, I forgot as not a prior ability. saving all comments was not a feature in basic acrobat reader 9, but may be found in more recent PDF editors
@GitHubRulesOK Ok, thanks for the clarification. I hope highlights exporting will be implemented somehow. Let me know if I can help.
@kjk as you know all too well there are multiple ways a user can add textual content Annotation can carry embeded text files (a separate open issue, I mistook for request)
Annotation can carry extensive screen text as "tooltips" without a visible object (a recent open issue) Annotation can carry visible "free text" (related to, but not this issue) Annotation either as icon or highlight can "pop-up" comments either via tooltip or editor box There are others
In this case the requirement is to export at least the later group to a text file, a feature of collecting and tagging page comments for fresh export that would as a minimum require re-collating such objects into page order. The most likely request if such an ability is built will be to sort into page order by means of using/showing negative Y offsets
I plan on enabling JavaScript bindings for operating on PDF files, like in mutool. This could be implemented as a JavaScript program.
@kjk Just a word of caution that if scripting actions are enabled that unlike MuPDF and older Acrobat (where you need to remember to deactivate auto running), I feel the default should be OFF and a manual step be provided to activate on a per use basis.
Is there any plan to develop the feature that save annotations separately?
@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)
@ribtoks Could you propose how this format would look like? I assume the output would be then used in some way. I could copy foliate (but it's epub reader, not PDF reader so might not translate 100%). What information should be included? Just the text of annotation? They type (highlight, underline etc.)? Should include page number / position on the page?
I could export in some simple text format, e.g.:
annotation 1
---
second annoation
---
third annotation
Or in json:
[
{ "text": "annotation 1", page: 17, .... },
{ .... }]
Just to kick off Here is the most basic output from Xchange (I deliberatly kept it simple but it should convey more)
////////////////////////////////////////////////////////////////////////////////////////////////////
// Summary of comments on MyOutput _[note this is a .pdf]_
////////////////////////////////////////////////////////////////////////////////////////////////////
Page: 1
----------------------------------------------------------------------------------------------------
Page: 1
Type: Ink Author: <None> Subject: <None> Date: <None>
Page: 1
Type: FreeText Author: K Subject: <None> Date: 2021-06-20, 04:05:05
Hello World!
Page: 1
Type: Highlight Author: K Subject: <None> Date: 2021-06-20, 22:05:04
Type: Text Author: K Subject: Sticky Note Date: 2021-09-10, 21:38:00
what is the context here there is no copied content use SHIFT A next time
So note as requested by OP it has the content for FreeText but not any content from Highlight which was desired . In no case is there a hint of page position (X,Y,dx.dy) nor colour coding as may be visually added for author or subject grouping. Those may be exported by means of an [X]FDF file but that's way more complex as its similar to the PDF page input
I am sorry I didn't describe clearly, most time I don't want to save annotations in the original pdf file, so I wonder is there any way can save annotations out of pdf files, when I open a pdf, it can load corresponding annotation file same time. The annotation file format, I think it could be a json? I know it is a big change for a software, maybe sumatra can give a choice, thanks a lot. such as :
@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)
@kjk Thank you for working on this issue.
First of all, I'd like to remind the whole idea why I need it:
As for "simple text format", it can be plain text or Markdown (preferred for me). The information I need there are only highlights and notes, chapter title, subheading title (if there's one), maybe page number. There's no need to know the type of annotation or other technical details in this "simple" text format.
Additionally, json be also quite convenient, it can contain more technical information so that exporter can run some sort of jq
template over it and make "simple text format" from (1) themselves. In my eyes this could have been a step 2 of extending "simple text format" from 1.
I'd like to provide an example of "simple text format", because for json you will know better what "properties" do you have for annotation and for json you can just dump all of them.
---
#### Chapter 3: How to do XYZ - yellow - p. 123
> Here goes the actual quote: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
---
@ribtoks Your description is good but PDF does not have such construction as paragraph or chapter unless defined by human eye. The highlight has a highly complex structure that can be considered as an overlay above a single page of content thus does not know about the underlying text or "chapter" only its co-ordinates on the page. It is the user that inserts the comments (unless auto copied at time of highlight). Thus what can be garnered for listing is limited to:-
<coded ID of other comment>
)There are two prescribed methods for programmable extractions in several ways but are not of value to a human reader as they are designed to EX-port the overlay the norm is FDF and the xml version is XFDF but they then need conversiion into Json / XML using complex decoding as an example the xml one for a few comments would run to pages and look like
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="../../../MyData/out4.pdf"/>
<ids original="C15E80D5BDF0828DC94C16477298D2DF" modified="7FFD0396F1FFF203EB9161E04C09935B"/>
<annots>
<ink page="0" flags="print" name="2d66392a-0430-4d28-815f6c00f6404f8a" rect="327.053986,595.205017,595.440002,771.515991" color="#004DE6">
<inklist><gesture>461.785004,761.403992;461.785004,761.531006;461.785004,761.828003;461.785004,762.286987;461.657013,762.877014;461.359985,763.559021;460.901001,764.171021;460.183014,764.771973;459.075012,765.392029;457.705994,765.916992;456.183014,766.432983;454.585999,766.982971;452.970001,767.585022;451.234985,768.111023;449.480011,768.63></inklist>
</ink>
<freetext intent="FreeText" IT="FreeText" title="K" page="0" date="D:20210620030505Z" flags="print" name="a104f142-9195-4156-85844ebed3dc4daa" rect="359.740875,605.969116,559.740845,705.969116" width="0">
<contents>Hello World!</contents><defaultappearance>/Helv 30 Tf 1 0 0 rg</defaultappearance></freetext>
<highlight coords="39,674.669983,198,674.669983,39,661.669983,198,661.669983,18,662.669983,306,662.669983,18,649.669983,306,649.669983,18,650.669983,104,650.669983,18,637.669983,104,637.669983" title="K" page="0" date="D:20210620210504Z" flags="print" name="0631dd8c-bae4-46f5-b1841f9c9e5d16e3" rect="14.935769,636.857483,309.06424,675.482483" color="#FF00FF"/>
<text icon="Comment" inreplyto="2d66392a-0430-4d28-815f6c00f6404f8a" title="K" creationdate="D:20210910212148+01'00'" page="0" date="D:20210910212148+01'00'" flags="hidden,print,nozoom,norotate" name="3e3d1483-529e-4b7c-b6c64eeda95dd864" rect="100,102,120,120" color="#FFFF00"/>
</annots>
</xfdf>
So not human readable XML as one might expect
@GitHubRulesOK In such case it's best to keep the "simple text format" simple (I mean no need for coordinates, rgb value or annotation type)
Something like
---
> quote here
(p. 123)
---
Will work both for .txt
and .md
.
@ribtoks again i agree with the sentiment keep it stupidly simple, however experience of others desires suggests the XY position within a page of multiple entries may aid in back searching such as used by LaTeX synctex or other programmable recall so goto highlight on page 10 half way down is
SumatraPDF -page 10 -zoom "fit width" -scroll 50,500 -reuse-instance MyFavorite.pdf
so colour export is of less value compared to rect upper left which is desirable
@GitHubRulesOK My opinion is that XY coordinates might work well in a structured format like json. For simple text (human use) there's no point to provide XY - nobody will calculate on their own where is the highlight.
I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations
To extract annotations:
Current limitations:
Give it a try and let me know how it can be improved.
It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.
JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.
I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations
To extract annotations:
- drop a PDF file on the gray area
- click 'extract annotations'
- when it's done you can see JSON and text version in a text area below
- when you switch between the version, they are also copied to clipboard so that it's easy to Ctrl-V into a text editor
Current limitations:
- for highlight / underline etc. annotations there is no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself. it's possible to recover it, so I'll try to add it, but not today
Give it a try and let me know how it can be improved.
It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.
JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.
Amazing!Love you!Can it build in program and display the annotations once upon pdf and annotation json file loaded in program?
no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself
That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V
I suggest SHIFT+A or another key pair could auto include the text
I like the extra info in the JSON but too complex to use simply to parse a rect and conversely the txt output des not give any clue as to annotations place on a page which may often not be in -Y order
no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself
That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V
This is the whole point of this feature - to have the text that was highlighted by default. Otherwise it makes absolutely no value to have only the coordinates of annotations.
I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations
@kjk would you consider open sourcing this? I would love to be able to use it locally. Thanks!
@kings2u As a html function its complex Google reputedly liberated SUN JS from Oracle and the core PDF handling is Copyright 2021 Mozilla Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 but then again parts are MIT and Parts are not, FOSS is a minefield due to CopyWrongs
I’ve skimmed this thread but I’m unsure my usecase is covered so here it is: I have a PDF of a RPG rules. Some errata are issued by the creator. I annotate my PDF with the erratas and want to share my annotations with the community, but I’m not allowed to share the PDF, obviously.
So I’d like to export my annotations and with someway for other owners of the same PDF to import my annotations to have the erratas in their own copy.
@Geobert The export of annotations e.g. comments between users is called collaborative review or similar, Adobe are masters at providing corporate solutions that cover their products, but Reader is able to do that when backed up by more powerful Adobe suites.
A good editor for exporting comments is Tracker PDFXedit and even Foxit reader may have similar abilities, but neither may offer all Adobes review features.
For simple edits such as text you can export an FDF file with just the comments and a user can add their master copy as text then open the FDF will over stamp the PDF.
let me see if i can mock up an example.
@GitHubRulesOK Thanks for your answer! I tried PDFXedit, it exports the comments fine into an FDF but when opening this FDF, it wants to open the annotated version. The import button is grayed out.
@Geobert
So I PDF this book :-) and add annotation to my copy then export as FDF
here is the file with the name of my copy
%FDF-1.4
%âãÏÓ
1 0 obj
<<
/FDF <<
/Annots [2 0 R]
/F (2beExportImported.pdf)
/UF (2beExportImported.pdf)
>>
/Type /Catalog
>>
endobj
2 0 obj
<<
/BS <<
/Type /Border
/W 1
>>
/Contents (This is a text... written in SumatraPDF as a demonstration)
/DA (/Helv 12 Tf 0 0 0 rg)
/F 4
/M (D:20230925140856Z)
/NM (ed50503f-4305-4e5a-acbc1c4266d7fb78)
/Page 0
/Rect [393.37059 559.2384 593.3706 659.2384]
/Subtype /FreeText
/T (lez)
/Type /Annot
>>
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF
so we edit to what we expect a user copy to be lets say it is "import.pdf" and send it to you (in reality Acrobat reader does not change the name as we would be using same filename)
So I open the FDF by double click or Acrobat file open (SumatraPDF based on MuPDF does not have that ability)
Acrobat reader says that the FDF wants to write over the name i supplied "import.pdf"
Oh I see, one needs to edit the FDF! Thanks!
EDIT: the PDF is locked for edition, so it seems it can’t be done. I though it could because we can annotate such locked PDF, but the import of annotation seems to not work :-/
@Geobert hmm locked is an problem (ensure file is not in use) in many ways by Adobe (protection is worthless in other readers) and most main stream FDF apps will often be compatible with Adobe restrictive DRM practice! So Users would need to use an unlocked copy (plenty of web sites charge or free for the service) MuPDF and other tools such as qpdf can remove the restrictions easily. but it rewrites the source file. Not a problem for FDF as its overlay on page numbers.
Hi, Any progress on this? Is there a way to export annotations made on a pdf as a separate txt file?
@DD318
Hmm the experiment mentioned above is no longer available ?
Here is a sample from this page
JSON is a very bloated format but if that's what you want there is Coherent cpdf and pdfcpu that use that format. Output from cpdf -list-annotations-json
[
1,
53,
{
"/DA": { "U": "/Helv 12 Tf 1 0 0 rg" },
"/BS": { "/Type": { "N": "/Border" }, "/W": { "I": 0 } },
"/Rect": [
{ "F": 124.10363 },
{ "F": 457.0187 },
{ "F": 324.10365 },
{ "F": 557.0187 }
],
"/Subtype": { "N": "/FreeText" },
"/Type": { "N": "/Annot" },
"/P": { "I": 1 },
"/F": { "I": 4 },
"/M": { "U": "D:20240909154734Z" },
"/T": { "U": "K" },
"/Contents": { "U": "DD318 here is an FDF example?" },
"/AP": { "/N": 57 }
}
],
If you just want the comments then use the lesser form (no details about position or colour etc.)
cpdf -list-annotations "page15*.pdf"
Page 1: DD318 here is an FDF example?
Here is an output from pdfcpu that is simpler but basic unless you query the object ( Here Obj 53 does not show colour etc. just the text, but -verbose logging is overwhelming)
optimizing...
12 annotations available
Page 1:
FreeText:
Obj# │ Id │ Rect │ Content
━━━━━━━━━━┿━━━━┿━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━
53 │ │ (124, 457, 324, 557) │ DD318 here is an FDF example?
Link:
Obj# │ Id │ Rect │ Content
━━━━━━━━━━┿━━━━┿━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━
**7 entries removed to reduce space in this my comment**
32 │ │ (176, 134, 195, 153) │ https://github.com/DD318
33 │ │ ( 53, 798, 102, 813) │ https://github.com/Geobert
34 │ │ ( 53, 619, 84, 633) │ https://github.com/DD318
35 │ │ (146, 619, 212, 633) │ internal link
C:\Users\K\Downloads\SO\pdfcpu\pdfcpu_0.8.1_Windows_i386\pdfcpu_0.8.1_Windows_i386>
For import and export annotations Adobe developed FDF which looks like PDF but without the page contents just the overlay.
Actually the only one I could find in a hurry online is Apryse XFDF output but there are many other methods.
<?xml version="1.0" encoding=""?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<pdf-info xmlns="http://www.pdftron.com/pdfinfo" version="2" import-version="4"></pdf-info>
<fields></fields>
<annots>
<freetext TextColor="#FF0000" width="0" flags="print" date="D:20240909154734Z" name="55deabbf03673f70-13b5a23648e7aaec" page="0" rect="124.1036,457.0187,324.1037,557.0187" title="K">
<defaultstyle>font: Helvetica 12pt;color: #FF0000</defaultstyle>
<defaultappearance>/Helv 12 Tf 1 0 0 rg</defaultappearance>
<contents>DD318 here is an FDF example?</contents>
<apref y="557.0187" x="124.1036" gennum="0" objnum="53"></apref>
</freetext>
</annots>
<pages>
<defmtx matrix="1,0,0,-1,0,841.92"></defmtx>
</pages>
</xfdf>
MuPDF Mutool has several options. but it would need scripting to reduce the native output so here I have the one page and ask for that pages data where there are many /Annots but I only want the ones with comments so it is that previously mentioned number 53
53 0 obj <</Type/Annot/Subtype/FreeText/Rect[124.10363 457.0187 324.10365 557.0187]/BS<</Type/Border/W 0>>/DA(/Helv 12 Tf 1 0 0 rg)/P 2 0 R/F 4/M(D:20240909154734Z)/T(K)/Contents(DD318 here is an FDF example?)/AP<</N 57 0 R>>>>
@GitHubRulesOK thank you for explanation!
Since XFDF was mentioned above, is there any plan to add export of annotations to XFDF?
@lukaszjablonski The import/export of FDF and XFDF are Adobe Specific methodologies from when they bought out that ability decades ago. Most specifically Adobe Readers use PDF according to their own specifications not always per ISO Standards. so the "spec" was in theory per "Adobe technical note XML Forms Data Format Specification, Version 2.0." Thus not every 3rd party may do it exactly the same. Although now we have ISO standards for PDF 2.0 it should be possible to write one fairly common format. But then again XFA which was the prime user of XFDF is "depreciated" as not ISO standard.
Personally I use Tracker PDF Editor for that functionality but it is NOT always possible even with their extensive software to emulate Acrobats Proprietary formats. It would be akin to exporting XMLX format such as DocX using one of the many clones that cannot achieve 100% compatibility with MS Word compressed XML as used in 365.docX.
Converting FDF to XFDF should in theory be easy but why add so much more textual baggage (doubled tag wrapping) when FDF is more compact and compatible as PDF language and reputedly has greater abilities?
From the withdrawn and now replaced standard:-
FDF is a simplified version of PDF. PDF and FDF represent information with a key/value pair, also referred to as an entry. This example shows the T and V keys with values enclosed in parentheses:
/T (Street) /V (345 Park Ave.)
XFDF, on the other hand, represents an entry with an XML element/content or attribute/value pair, as shown in the corresponding XFDF:
<field name=”Street”> <value>345 Park Ave.</value> </field>
Hi
Thanks for developing Sumatra PDF reader. I was very excited to finally get PDF annotations released in version 3.3. Thank you for the hard work!
One feature that I'm missing though is the ability to export the annotations - in whatever format possible (e.g.
.txt
). I'm using this to make notes from the book and save them separately. Later I might make Anki flashcards from the notes or just save them in my notebook. There are examples of other software that can do that under Linux (for example, Foliate that can export annotations as HTML, markdown or plaintext) and I'm missing this from SumatraPDF.Would be incredible to see this feature!