pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Parsing complete scanned document #115

Closed SBhat2615 closed 1 month ago

SBhat2615 commented 1 month ago

Trying to parse a below scanned document.

Tried to convert scanned document to searchable using tesseract. Still no result. What is recommended way to parse such documents?

Using latest pymupdf4llm

scansmpl.pdf scansmpl.pdf-searchable.pdf

JorjMcKie commented 1 month ago

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())

        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research

-----
SBhat2615 commented 1 month ago

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research

-----

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research

-----

I had this parameter set,

write_images=True

Why does this affect the output?