pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.13k stars 492 forks source link

I am not sure if this is a bug. #3788

Closed tsuhuai closed 1 month ago

tsuhuai commented 1 month ago

I have a sample PDF. Hope that thse 5 interested lines can be extracted correctly and displayed correctly (please refer to the RED underlined of attached PNG file) 20240817_2051

The sample PDF file can be found here. https://www.nxp.com/testreports/360000002263_CDA_194_ZHM_A_HLGN.pdf

(update sample PDF)

JorjMcKie commented 1 month ago

The attached PDF is different from the attached image! Otherwise, text extraction seems ok - although a lot of weird stuff is extracted too.

This clearly is no error and I also see no basis for whatever "enhancement".

tsuhuai commented 1 month ago

I am talking about text extraction. You will find 'A194/C194 Cu Alloy' and 'Sample Name' are not extracted in the same line if you look at RED line 2 of reference image.

JorjMcKie commented 1 month ago

I am talking about text extraction. You will find 'A194/C194 Cu Alloy' and 'Sample Name' are not extracted in the same line if you look at RED line 2 of reference image.

That too is not a bug but a technical peculiarity of MuPDF. You need your own code to recover lines that roughly like the ones visible.

But there is example code that can be used for this:

import pymupdf
# import a helper method from sister package
from pymupdf4llm.helpers.get_text_lines import get_text_lines

doc = pymupdf.open("test.pdf")
page = doc[0]
text = get_text_lines(page)
print(text)

This produces the following output:

% DSUnknownq1 G1 g0.1 0 0 0.1 9 0 cm0 J 0 j 4 M []0 d1 i0 g313 292 m313 404 325 453 432 529 c478 561 504 597 504 645 c504 736 440 760 391 760 c286 760 271 681 265 626 c265 625 l100 625 l100 828 253 898 381 898 c451 898 679 878 679 650 c679 555 628 499 538 435 c488 399 467 376 467 292 c313 292 lh308 214 170 -164 ref0.44 G1.2 w1 1 0.4 rg287 318 m287 430 299 479 406 555 c451 587 478 623 478 671 c478 762 414 786 365 786 c260 786 245 707 239 652 c239 651 l74 651 l74 854 227 924 355 924 c425 924 653 904 653 676 c653 581 602 525 512 461 c462 425 441 402 441 318 c287 318 lh282 240 170 -164 reBQ

For Question Please
Contact with SGS
www.sgs.com.tw

測試報告

號碼(No.): ETR20A01145   日期(Date): 15-Oct-2020頁數(Page): 1 of 37

Test Report

復盛精密工業股份有限公司 (FUSHENG ELECTRONICS CORPORATION)
新竹縣寶山鄉新竹科學工業園區工業東九路17號 (NO. 17, INDUSTRY E. 9TH RD., SCIENCE PARK HSIN-CHU, TAIWAN, R.O.C.)

以下測試樣品係由申請廠商所提供及確認 (The following sample(s) was/were submitted and identified by/on behalf of
the applicant as):

樣品名稱(Sample Name):COPPER LEAD FRAME (銅合金導線架)
樣品型號(Style/Item No.):A194(C194)

================================================================================

收件日(Sample Receiving Date):08-Oct-2020
測試期間(Testing Period):08-Oct-2020 to 15-Oct-2020

測試需求(Test Requested):       (1)     依據客戶指定,參考RoHS 2011/65/EU Annex II及其修訂指令(EU) 2015/863測試

鎘、鉛、汞、六價鉻、多溴聯苯、多溴聯苯醚, DBP, BBP, DEHP, DIBP。 (As
specified by client, with reference to RoHS 2011/65/EU Annex II and amending
Directive (EU) 2015/863 to determine Cadmium, Lead, Mercury, Cr(VI), PBBs,
PBDEs, DBP, BBP, DEHP, DIBP contents in the submitted sample(s).)

(2)其他測試項目請見下一頁。 (Please refer to next pages for the other item(s).)

測試結果(Test Results):請參閱下一頁 (Please refer to following pages.)
結  論(Conclusion):     (1)     根據客戶所選擇的部位測試,其鎘、鉛、汞、六價鉻、多溴聯苯、多溴聯苯醚, DBP,

BBP, DEHP, DIBP的測試結果符合RoHS 2011/65/EU Annex II暨其修訂指令(EU)
2015/863之限值要求。 (Based on the performed tests on selected part of
submitted sample(s), the test results of Cadmium, Lead, Mercury, Cr(VI), PBBs,
PBDEs, DBP, BBP, DEHP, DIBP comply with the limits as set by RoHS Directive
(EU) 2015/863 amending Annex II to Directive 2011/65/EU.)

PIN CODE: 38B4FE48

This document is issued by the Company subject to its General Conditions of Service printed overleaf, available on request or accessible at https://www.sgs.com.tw/terms-of-service
and, for electronic format documents, subject to Terms and Conditions for Electronic Documents at https://www.sgs.com.tw/terms-of-service. Attention is drawn to the limitation of
liability, indemnification and jurisdiction issues defined therein. Any holder of this document is advised that information contained hereon reflects the Company’s findings at the time of its
intervention only and within the limits of client’s instruction, if any. The Company’s sole responsibility is to its Client and this document does not exonerate parties to a transaction from
exercising all their rights and obligations under the transaction documents. This document cannot be reproduced, except in full, without prior written approval of the Company. Any
unauthorized alteration, forgery or falsification of the content or appearance of this document is unlawful and offenders may be prosecuted to the fullest extent of the law. Unless otherwise
stated the results shown in this test report refer only to the sample(s) tested.

新北市五股區新北產業園區五權七路25 號 t+886(02)2299 3939        f+886(02)2299 3237

SGS Taiwan Ltd. 台灣檢驗科技股份有限公司  25, Wu Chyuan 7th Road, New Taipei Industrial Park, Wu Ku District, New Taipei City, Taiwan

Member of the SGS Group Group