pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

Marked content sequences in text trace dictionary #3489

Closed EluuuArcanum closed 1 month ago

EluuuArcanum commented 1 month ago

Have you considered adding information on the marked content sequences?

This could be one extra key (for example mcid) of the text trace dictionary.

Would that possible to add it also to the TextPage rawdict for every span?

It would be a really useful information for tagged PDF files created by ABBYY FineReader. This way one could reconstruct the document sructure ABBYY adds to these PDF files. This contains information about paragraphs, lines and table cells as well.

JorjMcKie commented 1 month ago

To provide this information is must be extractable by the base library MuPDF - which is currently not the case. So, please direct an enhancement request to the MuPDF team. I suggest to join their public Discord channel to directly discuss with them.

For the moment, I am going to close this as "won't fix" on behalf of PyMuPDF.

EluuuArcanum commented 1 month ago

I received the answer from the mupdf team. They say that the structure data is available in the new version of mupdf. The device now has a begin_structure and end_structure properties.

JorjMcKie commented 1 month ago

I received the answer from the mupdf team. They say that the structure data is available in the new version of mupdf. The device now has a begin_structure and end_structure properties.

That was quick! Thanks for letting me know. I looked at the device after a long pause again and can confirm that this info is now available there. However, whether we can implement this any time soon is a completely other question ...