pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.56k stars 450 forks source link

Form Field not being parsed #2715

Closed harsha-usethread closed 9 months ago

harsha-usethread commented 9 months ago

I am noticing that I am unable to get the Form field text. I am redacting the exact numbers for privacy reasons, and instead using random numbers. Here is what the PDF looks like:

image

Here is what the output ends up looking like:

image

I am using Langchain, so the problem could be with that library, but here is the code that Langchain uses to ingest the PDF file:

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Lazily parse the blob."""
        import fitz

        with blob.as_bytes_io() as file_path:
            doc = fitz.open(file_path)  # open document

            yield from [
                Document(
                    page_content=page.get_text(**self.text_kwargs),
                    metadata=dict(
                        {
                            "source": blob.source,
                            "file_path": blob.source,
                            "page": page.number,
                            "total_pages": len(doc),
                        },
                        **{
                            k: doc.metadata[k]
                            for k in doc.metadata
                            if type(doc.metadata[k]) in [str, int]
                        },
                    ),
                )
                for page in doc
            ]

Happy to add any additional points if that makes it easier to understand.

JorjMcKie commented 9 months ago

I am afraid that without the supposed problem file / page at hand, there is no way to say anything about the reason. When you use the word "parse", does that mean you look at property field_value of the widget?

JorjMcKie commented 9 months ago

I am going to close this for now. Please re-open or submit another issue once you can provide a reproducing file.