Support marked content - Githubissues

samuraitruong commented 3 years ago

Hi team,

Do we support document taggings, I try to read through the code and document but doesn't seem we have it supported?

Thanks

HackbrettXXX commented 3 years ago

What do you mean by "document taggings"?

samuraitruong commented 3 years ago

@HackbrettXXX sorry for confusing you. Actually, it should call content tagging which is 1 of the requirement for PDF accessibility detail here https://accessible-docs.com/tagging-accessible-pdf/

Thanks

HackbrettXXX commented 3 years ago

@samuraitruong thanks for the clarification. jsPDF does not support this currently. Pull requests are very welcome, though :)

samuraitruong commented 3 years ago

@HackbrettXXX thanks, I will try my best to see if I can introduce this feature. Do you have any hint for me to speed up?

HackbrettXXX commented 3 years ago

Not really. But have a look at our contribution guide.

stefan123t commented 3 years ago

You may want to take a look at the PDF 2.0 ISO 32000-2:2020 Standard. Please check the relevant chapters on 14.6 Marked Content (page 698-700), 14.7 Logical Structure (page 700-721) and 14.8 Tagged PDF (page 722-766).

I found an incarnation here intranet.pdfa.org/wp-content/uploads/2016/08/ISO_DIS_32000-2-DIS4.pdf

stefan123t commented 3 years ago

According to Chapter 14.6 Marked-content elements and the operators that mark them shall fall into two categories: • The MP and DP operators shall designate a single marked-content point in the content stream. • The BMC, BDC, and EMC operators shall bracket a marked-content sequence of objects within the content stream.

Chapter 14.7 holds an extensive example:

14.7.7 Example of logical structure

The Example shows portions of a PDF file with a simple document structure. The structure tree root (object 300) contains elements with structure types Chap (object 301) and Para (object 304). The Chap element, titled Chapter 1, contains elements with types Head1 (object 302) and Para (object 303).

These elements are mapped to the standard structure types specified in tagged PDF (see 14.8.4, "Standard structure types") by means of the role map specified in the structure tree root. Objects 302 through 304 have attached attributes (see 14.7.6, "Structure attributes" and 14.8.5, "Standard structure attributes").

The example in this subclause also illustrates the structure of a parent tree (object 400) that maps content items back to their parent structure elements and an ID tree (object 403) that maps element identifiers to the structure elements they denote.

EXAMPLE

1 0 obj                                             %Document catalog
    <</Type /Catalog
        /Pages 100 0 R                              %Page tree
        /StructTreeRoot 300 0 R                     %Structure tree root
    >>
endobj

100 0 obj                                           %Page tree
    <</Type /Pages
        /Kids [101 1 R                              %First page object
                102 0 R                             %Second page object
            ]
        /Count 2                                    %Page count
    >>
endobj

101 1 obj                                           %First page object
    <</Type /Page
        /Parent 100 0 R                             %Parent is the page tree
        /Resources <</Font <</F1 6 0 R              %Font resources
                    /F12 7 0 R
                >>
            >>
        /MediaBox [0 0 612 792]                     %Media box
        /Contents 201 0 R                           %Content stream

        /StructParents 0                            %Parent tree key
    >>
endobj

201 0 obj                                           %Content stream for first page
    <</Length ...>>
stream
    1 1 1 rg
    0 0 612 792 re f
    BT                                              %Start of text object
        /Span <</MCID 0>>                           %Start of marked-content sequence 0
            BDC
                0 0 0 rg
                /F1 1 Tf
                30 0 0 30 18 732 Tm
                (This is a first level heading. Hello world: ) Tj
                1.1333 TL T*
                (goodbye universe.) Tj
            EMC                                     %End of marked-content sequence 0

        /Span <</MCID 1>>                           %Start of marked-content sequence 1
            BDC
                /F12 1 Tf
                14 0 0 14 18 660.8 Tm
                (This is the first paragraph, which spans pages. It has four fairly short and \
                concise sentences. This is the next to last ) Tj
            EMC                                     %End of marked-content sequence 1
    ET                                              %End of text object
endstream
endobj

102 0 obj                                           %Second page object
    <</Type /Page
        /Parent 100 0 R                             %Parent is the page tree
        /Resources <</Font <</F1 6 0 R              %Font resources
                    /F12 7 0 R
                >>
            >>
        /MediaBox [0 0 612 792]                     %Media box
        /Contents 202 0 R                           %Content stream
        /StructParents 1                            %Parent tree key
    >>
endobj

202 0 obj                                           %Content stream for second page
    <</Length ...>>
stream
    1 1 1 rg
    0 0 612 792 re f
    BT                                              %Start of text object
        /Para <</MCID 0>>                           %Start of marked-content sequence 0
            BDC
                0 0 0 rg
                /F12 1 Tf
                14 0 0 14 18 732 Tm
                (sentence. This is the very last sentence of the first paragraph.) Tj
            EMC                                     %End of marked-content sequence 0

        /Span <</MCID 1>>                           %Start of marked-content sequence 1
            BDC
                /F12 1 Tf
                14 0 0 14 18 570.8 Tm
                ( This is the second paragraph. It has four fairly short and concise sentences . \
                This is the next to last ) Tj
            EMC                                     %End of marked-content sequence 1

        /Span <</MCID 2>>                           %Start of marked-content sequence 2
            BDC
                1.1429 TL
                T*
                (sentence. This is the very last sentence of the second paragraph.) Tj
            EMC                                     %End of marked-content sequence 2
    ET                                              %End of text object
endstream
endobj

300 0 obj                                           %Structure tree root
    <</Type /StructTreeRoot
        /K [301 0 R                                 %Two children: a chapter
            304 0 R                                 %and a paragraph
        ]
        /RoleMap <</Chap /Sect                      %Mapping to standard structure types
            /Head1 /H
            /Para /P
        >>
        /ClassMap <</Normal 305 0 R>>               %Class map containing one attribute class
        /ParentTree 400 0 R                         %Number tree for parent elements
        /ParentTreeNextKey 2                        %Next key to use in parent tree
        /IDTree 403 0 R                             %Name tree for element identifiers
    >>
endobj

301 0 obj                                           %Structure element for a chapter
    <</Type /StructElem
        /S /Chap
        /ID (Chap1)                                 %Element identifier
        /T (Chapter 1)                              %Human-readable title
        /P 300 0 R                                  %Parent is the structure tree root
        /K [302 0 R                                 %Two children: a section head
            303 0 R                                 %and a paragraph
        ]
    >>
endobj

302 0 obj                                           %Structure element for a section head
    <</Type /StructElem
        /S /Head1
        /ID (Sec1.1)                                %Element identifier
        /T (Section 1.1)                            %Human-readable title
        /P 301 0 R                                  %Parent is the chapter
        /Pg 101 1 R                                 %Page containing content items
        /A <</O /Layout                             %Attribute owned by Layout
            /SpaceAfter 25
            /SpaceBefore 0
            /TextIndent 12.5
        >>
        /K 0                                        %Marked-content sequence 0
    >>
endobj

303 0 obj                                           %Structure element for a paragraph
    <</Type /StructElem
        /S /Para
        /ID (Para1)                                 %Element identifier
        /P 301 0 R                                  %Parent is the chapter
        /Pg 101 1 R                                 %Page containing first content item
        /C /Normal                                  %Class containing this element’s attributes
        /K [1                                       %Marked-content sequence 1
            <</Type /MCR                            %Marked-content reference to 2nd item
                /Pg 102 0 R                         %Page containing second item
                /MCID 0                             %Marked-content sequence 0
            >>
        ]
    >>
endobj

304 0 obj                                           %Structure element for another paragraph
    << /Type /StructElem
        /S /Para
        /P 300 0 R                                  %Parent is the structure tree root
        /Pg 102 0 R                                 %Page containing content items
        /C /Normal                                  %Class containing this element’s attributes
        /A << /O /Layout
            /TextAlign /Justify                     %Overrides attribute provided by classmap
        >>
        /K [1 2]                                    %Marked-content sequences 1 and 2
    >>
endobj

305 0 obj                                           %Attribute class
    << /O /Layout                                   %Owned by Layout
        /EndIndent 0
        /StartIndent 0
        /WritingMode /LrTb
        /TextAlign /Start
    >>
endobj

400 0 obj                                           %Parent tree (number tree)
    <</Nums [0 401 0 R                              %Parent elements for first page
            1 402 0 R                               %Parent elements for second page
        ]
    >>
endobj

401 0 obj                                           %Array of parent elements for first page
    [302 0 R                                        %Parent of marked-content sequence 0
        303 0 R                                     %Parent of marked-content sequence 1
    ]
endobj

402 0 obj                                           %Array of parent elements
    [303 0 R                                        %Parent of marked-content
        304 0 R                                     %Parent of marked-content
        304 0 R                                     %Parent of marked-content
    ]
endobj

403 0 obj                                           %ID tree root node
    <</Kids [404 0 R]>>                             %Reference to leaf node
endobj

404 0 obj                                           %ID tree leaf node
    <</Limits [(Chap1) (Sec1.3)]                    %Least and greatest keys in tree
            /Names [(Chap1) 301 0 R                 %Mapping from element identifiers
            (Sec1.1) 302 0 R                        %to structure elements
            (Sec1.2) 303 0 R
            (Sec1.3) 304 0 R
        ]
    >>
endobj

I think special care needs to be taken into account regarding the Tagging of Lists and Tables detailled in both 14.8.4.8.3 Table structure types (page 740-742) and 14.8.5.7 Table attributes (page 763-765) for Accessibility to be implemented and Artifacts like Page header and footer including line numbers to be tagged as 14.8.4.8.7 Artifact structure type (page 744).

stefan123t commented 3 years ago

@samuraitruong you may want to look into the following code for an example implementing a Forms Module for jsPDF.

https://github.com/MrRio/jsPDF/blob/c44b9c1e02fd83e25a226f4e2ff55d20eee83743/src/modules/acroform.js#L187-L197

stefan123t commented 3 years ago

Reconsidering the most basic accessibility requirements, I think it might be best to first implement tagging for the <h1>, <h2>, <h3>, ... headings to build a general outline / reference for the content of a PDF, which is most important to reach individual parts of any document.

stefan123t commented 3 years ago

@HackbrettXXX how to go about implementing this reduced requirement ? I have seen the extensive use of html2canvas in html.js, is it feasible / possible to extract/annotate the h1, h2, h3, ... headings within the jsPDF code in order to build up such a tagged index for the PDF document ?

HackbrettXXX commented 3 years ago

@stefan123t I think that's not easy, if not impossible. The HTML tags don't survive through the canvas API. Maybe we can misuse the ignoreElements callback to know the currently "active" element. I don't know when html2canvas calls this callback, though.

stefan123t commented 3 years ago

@HackbrettXXX thanks for looking into this. Maybe it is not necessary for a tagged PDF to be generated natively. Actually I double checked our code and we are only using auto-table and native jsPDF calls, i.e. hopefully no html2canvas involved here. I keep my fingers crossed, though that not some image conversion requires to use html2canvas in the background.

@samuraitruong I have digged a bit deeper into the jspdf.js code and found that API.text is already doing some of the native implementation for text nodes in the PDF.

So the goal would be to surround e.g. one or several text nodes with a Marked Content Sequence as in the above example:

/Span <</MCID 0>>                           %Start of marked-content sequence 0
            BDC
...
            EMC                                     %End of marked-content sequence 0

The references to those Marked Content sequences would need to be kept in two dictionaries:

a structureTree with default roleMap together with the page number, structure element type (/S /Chap, /S /Head1, /S /Para), element id, human readable title/name and possible structure parent or child[] elements and the links e.g. /K 0 to the /MCID elements on the same page or as in the first paragraph 303 spanning the content /K [ 1 <<...>> ] across two pages.

See the example from above

300 0 obj                                           %Structure tree root
    <</Type /StructTreeRoot
        /K [301 0 R                                 %Two children: a chapter
            304 0 R                                 %and a paragraph
        ]
        /RoleMap <</Chap /Sect                      %Mapping to standard structure types
            /Head1 /H
            /Para /P
        >>
        /ClassMap <</Normal 305 0 R>>               %Class map containing one attribute class
        /ParentTree 400 0 R                         %Number tree for parent elements
        /ParentTreeNextKey 2                        %Next key to use in parent tree
        /IDTree 403 0 R                             %Name tree for element identifiers
    >>
endobj

301 0 obj                                           %Structure element for a chapter
    <</Type /StructElem
        /S /Chap
        /ID (Chap1)                                 %Element identifier
        /T (Chapter 1)                              %Human-readable title
        /P 300 0 R                                  %Parent is the structure tree root
        /K [302 0 R                                 %Two children: a section head
            303 0 R                                 %and a paragraph
        ]
    >>
endobj
...
303 0 obj                                           %Structure element for a paragraph
    <</Type /StructElem
        /S /Para
        /ID (Para1)                                 %Element identifier
        /P 301 0 R                                  %Parent is the chapter
        /Pg 101 1 R                                 %Page containing first content item
        /C /Normal                                  %Class containing this element’s attributes
        /K [1                                       %Marked-content sequence 1
            <</Type /MCR                            %Marked-content reference to 2nd item
                /Pg 102 0 R                         %Page containing second item
                /MCID 0                             %Marked-content sequence 0
            >>
        ]
    >>
endobj
...

and

a parentTree / numberTree which links the chapters, sections and paragraphs to those structure element via their identifiers

400 0 obj                                           %Parent tree (number tree)
...
404 0 obj                                           %ID tree leaf node
    <</Limits [(Chap1) (Sec1.3)]                    %Least and greatest keys in tree
            /Names [(Chap1) 301 0 R                 %Mapping from element identifiers
            (Sec1.1) 302 0 R                        %to structure elements
            (Sec1.2) 303 0 R
            (Sec1.3) 304 0 R
        ]
    >>
endobj

These structure tree and parent tree dictionaries could then be appended to the end of the PDF together with all the other deferred dictionaries like e.g. putFontDict under putResourceDictionary.

stefan123t commented 3 years ago

@HackbrettXXX Lukas there is a non-technical, native-german article about the reasons for tagging PDF and how the previous technical example relates to users of screen-readers and likewise accessibility technologies, just in case you are interested:

Lesen, was drinsteht — rausholen, was drinsteckt: Wie blinde Computernutzer sich PDF-Dokumente zugänglich machen 6.3 PDF mit und ohne Tags https://www.barrierefreies-webdesign.de/knowhow/pdf-screenreader/tagged-pdf.html

HackbrettXXX commented 3 years ago

@stefan123t thanks for the tip ;)

Also, your approach looks good. From an API perspective we could add two methods like this:

doc.beginMarkedContent({ title: "Chapter 1", ...})
doc.text(...)
doc.endMarkedContent()

We also need to keep in mind that (AFAIK) PDF 1.3 does not support marked content elements, so we need an option to set the PDF version.

stefan123t commented 3 years ago

@samuraitruong we might want to reconsider the simple approach, it is not done embracing the marked content with those method calls, but the structure of the content needs to be handled in that separate structureTree from its root together with the addPart and addSection methods to fill in the structure. So addText and addImage get extended to call addContentToCurrentSection() or addImageToCurrentSection together with beginMarkedContent and endMarkedContent, ... well see for yourself: I found the following reference implementation being the core of an accessible PDFBox Example which creates an accessible PDF Document using Java. I hope that helps to get you started on implementation ?

@HackbrettXXX there is also openhtmltopdf which has a way to convert between html and pdf with PDF-Accessibility and PDF tagging in mind. Though it is also written in Java maybe some of the implementation details could be reintroduced either in jsPDF.html() method or html2canvas, whatever seems fit from your perspective.

pauldwaite commented 3 years ago

Just as a comparison, PDFKit supports some PDF accessibility stuff, and I found its documentation a decent guide to PDF accessibility in general: http://pdfkit.org/docs/accessibility.html

KurtGokhan commented 2 years ago

Related issue: #461 #3146

I investigated this a bit. I tried adding a "pseudo" element before and after the element to be tagged to effectively wrap and identify the element, but that didn't go well. The problem is, html2canvas does some kind of batching, which renders items out of order. ignoreElements is called in clone phase so I don't think it will work either.

So a generalized solution is very hard, or impossible unless jsPDF does its own html rendering or html2canvas source code is changed.

Uzlopak commented 2 years ago

Html2canvas is only "painting". If you have something hidden like title tag of an image, it will not be "painted" to the pdf.

The fromHTML method is deprecated as it is too much work to implement an html parser with all features from CSS etc.. thats why we provide an context2d API so that the very good project html2canvas makes the heavy lifting regarding html.

Maybe if html2canvas provides some callback method per element where you could utilize annotations, we could implement some accessibility to jsPDF.

stefan123t commented 2 years ago

@Uzlopak I like your idea of having the support implemented across both projects, ie. also for hidden stuff in html2canvas adding a callback sounds reasonable. So things written/painted with such a callback have to be painted in a non visible ink in case they are hidden/overwritten by something else. Much like the overlay in an OCRed text that can be used to search, select & copy parts of the digital version on top of the canvas showing only the image. So everything with such a callback has to be maintained as a block in html2canvas. Maybe we could blend it out by a clipping/bounding box which makes it visually disappear?

rnewhook586 commented 11 months ago

I would like to bump this up - As per 508 requirements, WCAG 2.1 AA compliance is required for PDFs.

After reading through I want to give a designer prospective, firstly I think it is best to honor any accessibility configurations found in the HTML. This avoids doubling up on work. I have added comments on images specifically: https://github.com/parallax/jsPDF/issues/2063#issuecomment-1745528446

The idea here is to take pre-existing tags/properties and map them to the PDF export.

Does anyone know if adding tags is possible with this code-base? If not, I am going to move on to an alternative so any suggestions are welcome.

More General Info: WCAG for PDFs Specifically: https://www.w3.org/TR/WCAG20-TECHS/pdf 508 Compliance: https://www.access-board.gov/ict/#about-the-ict-accessibility-standards WCAG 2.1: https://www.w3.org/TR/WCAG21/

parallax / jsPDF

Support marked content #3096