Closed SBhat2615 closed 1 month ago
Don't understand what you did wrong. Here is my script and associated output:
from pathlib import Path
import pymupdf4llm
md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
THE SLEREXE COMPANY LIMITED
SAPORS LANE - BOOLE - DORSET - BH 25 8ER
TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
Our Ref. 350/PJC/EAC 18th January, 1972.
Dr. P.N. Cundall,
Mining Surveys Ltd.,
Holroyd Road,
Reading,
Berks.
Dear Pete,
Permit me to introduce you to the facility of facsimile
transmission.
In facsimile a photocell is caused to perform a raster scan over
the subject copy. The variations of print density on the document
cause the photocell to generate an analogous electrical video signal.
This signal is used to modulate a carrier, which is transmitted to a
remote destination over a radio or cable communications link.
At the remote terminal, demodulation reconstructs the video
signal, which is used to modulate the density of print produced by a
printing device. This device is scanning in a raster scan synchronised
with that at the transmitting terminal. As a result, a facsimile
copy of the subject document is produced.
Probably you have uses for this facility in your organisation.
Yours sincerely,
ThA.
P.J. CROSS
Group Leader - Facsimile Research
-----
Don't understand what you did wrong. Here is my script and associated output:
from pathlib import Path import pymupdf4llm md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf") Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
THE SLEREXE COMPANY LIMITED SAPORS LANE - BOOLE - DORSET - BH 25 8ER TELEPHONE BOOLE (945 13) 51617 - TELEX 123456 Our Ref. 350/PJC/EAC 18th January, 1972. Dr. P.N. Cundall, Mining Surveys Ltd., Holroyd Road, Reading, Berks. Dear Pete, Permit me to introduce you to the facility of facsimile transmission. In facsimile a photocell is caused to perform a raster scan over the subject copy. The variations of print density on the document cause the photocell to generate an analogous electrical video signal. This signal is used to modulate a carrier, which is transmitted to a remote destination over a radio or cable communications link. At the remote terminal, demodulation reconstructs the video signal, which is used to modulate the density of print produced by a printing device. This device is scanning in a raster scan synchronised with that at the transmitting terminal. As a result, a facsimile copy of the subject document is produced. Probably you have uses for this facility in your organisation. Yours sincerely, ThA. P.J. CROSS Group Leader - Facsimile Research -----
Don't understand what you did wrong. Here is my script and associated output:
from pathlib import Path import pymupdf4llm md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf") Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
THE SLEREXE COMPANY LIMITED SAPORS LANE - BOOLE - DORSET - BH 25 8ER TELEPHONE BOOLE (945 13) 51617 - TELEX 123456 Our Ref. 350/PJC/EAC 18th January, 1972. Dr. P.N. Cundall, Mining Surveys Ltd., Holroyd Road, Reading, Berks. Dear Pete, Permit me to introduce you to the facility of facsimile transmission. In facsimile a photocell is caused to perform a raster scan over the subject copy. The variations of print density on the document cause the photocell to generate an analogous electrical video signal. This signal is used to modulate a carrier, which is transmitted to a remote destination over a radio or cable communications link. At the remote terminal, demodulation reconstructs the video signal, which is used to modulate the density of print produced by a printing device. This device is scanning in a raster scan synchronised with that at the transmitting terminal. As a result, a facsimile copy of the subject document is produced. Probably you have uses for this facility in your organisation. Yours sincerely, ThA. P.J. CROSS Group Leader - Facsimile Research -----
I had this parameter set,
write_images=True
Why does this affect the output?
Trying to parse a below scanned document.
Tried to convert scanned document to searchable using tesseract. Still no result. What is recommended way to parse such documents?
Using latest pymupdf4llm
scansmpl.pdf scansmpl.pdf-searchable.pdf