nvaccess / nvda

NVDA, the free and open source Screen Reader for Microsoft Windows
Other
2.08k stars 628 forks source link

Enhance math support for PDF by supporting math in associated files #9288

Open NSoiffer opened 5 years ago

NSoiffer commented 5 years ago

Feature Request

Many years ago, NVDA added support for reading math in PDF documents. Unfortunately, the mechanism that PDF described for adding MathML to a PDF is difficult for software to generate, so other than test documents and sample hand tagging, there are not many PDFs around that tag the math in this manner.

In PDF v2 (ISO 32000-2), a much simpler method was added to tag math: associated files. This loses a little functionality (synchronized highlighting becomes much harder for AT that wants to do that), but it makes it much easier to add MathML. This request is for NVDA to add to its existing MathML functionality the ability to get the math from the associated file.

Because NVDA already has code to pass MathML to an application that can braille it/produce speech for it (e.g, MathPlayer), the work required here is to additionally look in the associated file for MathML. Sadly, Adobe has not updated their accessibility interface to v2, so getting that info requires diving into (I think) the PDSEdit layer. Doing so is not rocket science, but it is obviously more work than a few more PDomNode calls.

PDF Details

Spec

Section 14.13 of the ISO 32000-2 spec discusses associated files. Here are some relevant quotes from the spec:

Associated files provide a means to associate content in other formats with objects of a PDF file and to identify the relationship between them. Such associated files are designated using file specification dictionaries (see 7.11.3, "File specification dictionaries"), and AF keys are used in object dictionaries to connect the associated file’s specification dictionaries with those objects. For associated files, their associated file specification dictionaries should include the AFRelationship key indicating one of several possible relationships that the file has to the associated PDF object The file specification for an associated file represents either a file external to the PDF file or an embedded file stream (see 7.11.4, "Embedded file streams") within the PDF file. It should always be the case that the MathML is an embedded file stream, not an external file.

...the resulting PDF document might contain the following embedded files: ...MathML version of the equation embedded with an AFRelationship value Supplement, and associated using a structure element or a form XObject depending on how the equation is rendered in the page’s content stream.

14.13.6 Associated files linked to structure elements One or more files may be associated with structure elements (see 14.7.2, "Structure hierarchy") to accommodate content that spans pages such as in an article, section or table, in which cases logical structural elements should be used to make an association with files. This entry represents the associated files for the entire structure element. To associate files with structure elements, the structure element dictionary shall contain an AF entry which represents the associated files for that structure element. The relationship that the associated files have to the structure element is supplied by the AFRelationship key in each file specification dictionary.

Other potential places in the spec for info:

Acrobat API

The overview of the Acrobat API is found here. I believe the relevant interface to access is PDSElement. This provides access to the structure tree. Potentially the COS layer is involved to access the dictionary structure.

Since I was looking, it might save someone a minute to know that the MathML code for acrobat is in NVDAObjects/IAccessible/adobeAcrobat.py.

Adriani90 commented 1 year ago

cc: @michaelDCurran and @seanbudd in case there are plans to improve math reading with NVDA, yet another use case.

NSoiffer commented 1 year ago

FYI: an update...

There is a project that Adobe has funded for the last few years to get pdftex to produce tagged PDF, including MathML in an associated file. This is being done by rewriting the core of the main TeX implementation to pass structural information through to the stage where the PDF gets generated. The latest I saw was that math part is to be worked on in March, 2024. It among the last things that they are doing. That will produce a lot of PDFs with MathML in them. Their goal is to be nearly 100% backwards compatible, so old PDFs just need to be regenerated from unmodified Tex/LaTeX to get well tagged PDF.

AFAIK, Adobe has yet to update their API. However, I have been talking with Foxit and they are working on providing access to the associated file in their PDF viewer. In fact, if they implement my suggestion of changing the ROLE from ROLE_SYSTEM_TEXT to be ROLE_SYSTEM_EQUATION in a alpha version that handles associated files that they showed me, I think it will be 5-10 lines of code in _getNodeMathMl in adobeAcrobat.py to get NVDA to read the math.

If that works out, I hope Adobe follows their lead and does it the same way. No interface change and minimal NVDA changes and math in PDF becomes accessible. Fingers crossed...

NSoiffer commented 11 months ago

Another update: it turns out it was four lines of code to make this work. It would have been three, but there is a bug in what they did so I need to do a bit of surgery on the MathML they generated; for readability, I split it onto another line.

If foxit decides this is an approach they like, I'll try to find some people at Adobe and see if they will be willing to expose the associated file in the same way. If so, then I'll do a PR.

In case it wasn't clear: this PR would work for MathPlayer, MathCAT, Access8Math, and any other math provider.

davidcarlisle commented 8 months ago

There are several PDF files demonstrating Associated MathML file tagging at the LaTeX Project page

https://github.com/latex3/tagging-project/discussions/56