pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

Embedded full-text search index #3476

Closed ZupoLlask closed 1 month ago

ZupoLlask commented 1 month ago

Dear Community,

As I haven't found information on this, can you let me know if PyMuPDF supports full-text search index embedding?

This is super useful to speed-up searches in large PDF documents.

Thank you!

JorjMcKie commented 1 month ago

Sorry, we do not support this - obviously paid-for - feature, and there are currently no plans to do so either.

ZupoLlask commented 1 month ago

Thanks for the quick feedback!

Just to make it clear: you also don't support this on your commercial offering, right?

JorjMcKie commented 1 month ago

Consider this: That Adobe solution obviously modifies the PDF permanently by storing keyword search results somewhere inside it for later retrieval.

Upon later retrieval you again need an application that knows

  1. how to sense that extended search information is present in the PDF at all,
  2. where to find it and
  3. how to interpret these data and use them in answering user inquiries.

If I am not missing something basic here, this sounds to be a specific solution around that specific application. At least in the official PDF specification there is no such thing as an "Embedded Index".

What I mean is: open such a PDF in a non-Adobe PDF viewer and I would be amazed if it can use such an embedded index. Only at that point things may become interesting.

ZupoLlask commented 1 month ago

Dear Jorj,

I totally agree with your reasoning.

Any application that makes use of such index, either reverse engineered proprietary data structure and its features to make use of them, or is also embedding a different proprietary search index under /PieceInfo that will also be only usable by some specific application.

Thanks for your time and for your clarification.