pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

page.links return all links with same xref, is it something possible ?? #3563

Closed Flint-company closed 2 weeks ago

Flint-company commented 3 weeks ago

I'm very suprised to analyze a pdf and try to get all the links and it give me a dict with links but all the same "xref". Is there a way to delete these link although they all have the same xref ? Thanks

    for page in doc.pages():
        print(f"links : {page.get_links()}")
        text += page.get_text().lower()
        links.extend([x['uri'].lower() for x in page.links(kinds=[pymupdf.LINK_URI])])
=> 
"Links : [{'kind': 2, 'xref': 0, 'from': Rect(248.3167724609375, 174.28570556640625, 279.57891845703125, 183.88568"
    "115234375), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(410.7126770019531, 37"
    ".824951171875, 496.1625671386719, 46.2249755859375), 'uri': 'mailto:XXXXXs'}, {'kind': 2, 'xref"
    "': 0, 'from': Rect(410.7126770019531, 55.4649658203125, 456.3170166015625, 63.86492919921875), 'uri': 'https://XXXXXs/'}, {'kind': 2, 'xref': 0, 'from': Rect(238.4034881591797, 699.2244873046875, 260.1038818359375,"
    " 708.824462890625), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(267.0658264160156, 699.2244873046875, 284.93472290039"
    "06, 708.824462890625), 'uri': 'XXXXXs'"
    "}, {'kind': 2, 'xref': 0, 'from': Rect(291.89666748046875, 699.2244873046875, 336.628173828125, 708.824462890625)"
    ", 'uri': 'https://XXXXXs"
    "ahier-des-charge'}, {'kind': 2, 'xref': 0, 'from': Rect(343.5901184082031, 699.2244873046875, 412.5794982910156, "
    "708.824462890625), 'uri': 'https://XXXXXs"
    "onvertir'}, {'kind': 2, 'xref': 0, 'from': Rect(419.54144287109375, 699.2244873046875, 479.58209228515625, 708.82"
    "4462890625), 'uri': 'https://XXXXXs/'}]"
JorjMcKie commented 3 weeks ago

Please provide a reproducing example. So far your post leads to nothing actionable.

Flint-company commented 3 weeks ago

Please provide a reproducing example. So far your post leads to nothing actionable.

Hard to do since it's a resume of an existing person and personal data... You have a way to workaround this to provide the example ?

JorjMcKie commented 2 weeks ago

Your PDF obviously has a problem which we should intercept and handle in a better way. So, no: we need a reproducer to confirm that we guessed the right cause. But you can use my private email for the submission so it won't be exposed to the public. Otherwise this post will never become a bug report ...

Gesendet von Outlook für Androidhttps://aka.ms/AAb9ysg


From: Flint @.> Sent: Monday, June 10, 2024 3:03:12 AM To: pymupdf/PyMuPDF @.> Cc: Jorj X. McKie @.>; Comment @.> Subject: Re: [pymupdf/PyMuPDF] page.links return all links with same xref, is it something possible ?? (Issue #3563)

Please provide a reproducing example. So far your post leads to nothing actionable.

Hard to do since it's a resume of an existing person and personal data... You have a way to workaround this to provide the example ?

— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/3563#issuecomment-2157494257, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUIV7R7E3PED7QX3V3ZGVFTBAVCNFSM6AAAAABJBAXOWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJXGQ4TIMRVG4. You are receiving this because you commented.Message ID: @.***>

JorjMcKie commented 2 weeks ago

The example PDF shared with me violates the specifications for links / annotations:

image

Instead of giving indirect references as it should be, it provides all the links dirctly in the /Annots array. IAW it should look like /Annots [4711 0 R 4712 0 R ...]. Instead we find:

/Annots [ <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 248.31678 596.1143 279.57893 605.7143 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://alexialabbe.fr/#projects)
        >>
      >> <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 238.40349 71.17554 260.10389 80.775539 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://blog.codein.fr/guide-rgpd-les-pratiques-essentielles-pour-assurer-la-conformite-de-votre-site-web)
        >>
      >> 
... ]

So pymupdf does recognize the links, but cannot assign an xref to them (xref=0 consequently). You cannot update / delete links in PyMuPDF using the normal API (delete_link etc.) in such a situation - no way. But you can edit the page's object definition source using low-level API and kill everything: for this you could delete the whole /Annots array. This will remove everything (!!!): links, annotations and fields that may be on the page.

doc.xref_set_key(5, "Annots", "null")

print(doc.xref_object(5))  # 5 = page xref

<<
  /Type /Page
  /Parent 1 0 R
  /MediaBox [ 0 0 540 780 ]
  /Contents 134 0 R
  /Resources <<
    /ExtGState <<
      /Alpha0 10 0 R
      /Alpha1 11 0 R
    >>
    /Font <<
      /Font4 14 0 R
      /Font11 21 0 R
      /Font12 22 0 R
      /Font5 15 0 R
    >>
  >>
  /Annots null
  /Group <<
    /S /Transparency
    /CS /DeviceRGB
  >>
>>

All links are gone!

JorjMcKie commented 2 weeks ago

BTW the example page looks exactly the same, but all hot areas are gone. Also, the file size (when saving via ez_save()) goes down to 44KB (was 1 MB before).

Flint-company commented 2 weeks ago

Thanks Jorj !!