Closed EluuuArcanum closed 1 month ago
To provide this information is must be extractable by the base library MuPDF - which is currently not the case. So, please direct an enhancement request to the MuPDF team. I suggest to join their public Discord channel to directly discuss with them.
For the moment, I am going to close this as "won't fix" on behalf of PyMuPDF.
I received the answer from the mupdf team. They say that the structure data is available in the new version of mupdf. The device now has a begin_structure and end_structure properties.
I received the answer from the mupdf team. They say that the structure data is available in the new version of mupdf. The device now has a begin_structure and end_structure properties.
That was quick! Thanks for letting me know. I looked at the device after a long pause again and can confirm that this info is now available there. However, whether we can implement this any time soon is a completely other question ...
Have you considered adding information on the marked content sequences?
This could be one extra key (for example mcid) of the text trace dictionary.
Would that possible to add it also to the TextPage rawdict for every span?
It would be a really useful information for tagged PDF files created by ABBYY FineReader. This way one could reconstruct the document sructure ABBYY adds to these PDF files. This contains information about paragraphs, lines and table cells as well.