py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.43k stars 1.42k forks source link

BUG: Issue in text extraction (spaces) (#1153) #2882

Closed ssjkamei closed 1 month ago

ssjkamei commented 1 month ago

Closes #1153

The change if abs(moved_height) > 0.8 * f: at line 129 of the crlf_space_check function changes the function to look at both the bottom and top misalignment, but if you prefer not to change it here, I commit the reverted code.

The test for hello-world.pdf fails. I think a hidden bug has surfaced, but I don't seem to have read enough of the documentation to determine how to address it from within the PDF 1.7 specification.

I am judging by the position specified by Td and Tm, but perhaps I am judging both as if they were absolute coordinates, causing a misalignment, but I don't know how to correct it correctly.

Target String: สวัสดีชาวโลก iss1153_2024-09-28

ssjkamei commented 1 month ago

I forgot one more thing. I don't understand what the 15 specified here means after all.

https://github.com/py-pdf/pypdf/blob/c8220c6443ff94955b2cef6bf3f9411e51c2018e/pypdf/_text_extraction/__init__.py#L136

pubpub-zz commented 1 month ago

I forgot one more thing. I don't understand what the 15 specified here means after all.

https://github.com/py-pdf/pypdf/blob/c8220c6443ff94955b2cef6bf3f9411e51c2018e/pypdf/_text_extraction/__init__.py#L136

this is a long time ago and I must admit that I cannot remember why this value. will try to gather my memories

ssjkamei commented 1 month ago

Sorry, I added the code for CIDFont, but it seems to need mapping to cmap[1].

Also, I am calculating the font each time, which is very inefficient.

It doesn't seem to support vertical type characters, but since it seems to be from the original, I don't intend to include that in this PR.

ssjkamei commented 1 month ago

CIDFont is supported based on the PDF in the following Issue.

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336 I think we can close #2336 as well.

The following will remain, but the spaces are moving correctly (because the letters are shaved, the distance traveled is greater than the letters, and the spaces are showing)

Expected: R. Ottokar Doerffel, 1489, Atiradores Actual: R. O okar Doerff el, 1489, Ar adores

This is probably analogous to the following problem. Ligature issue when converting PDF to text #1351

Dealing with the calculation of fonts each time will be revised later.

ssjkamei commented 1 month ago

Dealing with the calculation of fonts each time will be revised later.

I have divided the functions and put them together. The code still calculates each time. Do you have any good ideas on how to store the data? I have created a function called build_font_width_map, which should only need to be called once when checking space. However, it is recalculated every time Tj in _page.py is called.

I thought about adding it to the cmap, but I thought this would be a lot of changes and you all would not like it.

Also, I don't know how to fix the following two points.

  1. The number of spaces seems to be partially different in layout mode. Is this acceptable?

  2. I get the following error, but I don't know exactly how to fix it.

pypdf/_cmap.py:464: error: No overload variant of “list” matches argument type “PdfObject” [call-overload].
pypdf/_cmap.py:464: note: Possible overload variants: “PdfObject” [call-overload].
pypdf/_cmap.py:464: note: def [_T] __init__(self) -> list[_T]
pypdf/_cmap.py:464: note: def [_T] __init__(self, Iterable[_T], /) -> list[_T]
Found 2 errors in 1 file (checked 51 source files)

I have not yet been able to see the following properly, but it seems to have been fixed.

I am judging by the position specified by Td and Tm, but perhaps I am judging both as if they were absolute coordinates, > > > causing a misalignment, but I don't know how to correct it correctly.

Target String: สวัสดีชาวโลก

ssjkamei commented 1 month ago

Sorry for the continuous posting. There is another test case that had an error, which is failing due to a space after the 1. I would like the file for this one to determine if the original is correct, etc. How can I get it? filename: unixxx_glyphs.pdf error: 1′′.2 != 1 ′′.2

    @pytest.mark.enable_socket()
    def test_unixxx_glyphs():
        reader = PdfReader(BytesIO(get_data_from_url(name="unixxx_glyphs.pdf")))
        txt = reader.pages[0].extract_text()  # no error
        for pat in ("闫耀庭", "龚龑", "张江水", "1′′.2"):
>           assert pat in txt
ssjkamei commented 1 month ago

I have divided the functions and put them together. The code still calculates each time. Do you have any good ideas on how to store the data? I have created a function called build_font_width_map, which should only need to be called once when checking space. However, it is recalculated every time Tj in _page.py is called.

I don't know if it's good, but I made the correction. As for the other issues, they seem difficult for me to fix.

I think I can fix the Check code style issues part if you give me some help.

I think the corrected part has been changed to a process closer to the PDF 1.7 specification.

stefan6419846 commented 1 month ago

Thanks for the PR.

As for the other issues, they seem difficult for me to fix.

I think I can fix the Check code style issues part if you give me some help.

I have added a corresponding review comment for the code style part.

The tests itself are failing due to your changes to the whitespace handling and thus differing output. It the new output looks good/better, you can update the expected outputs, otherwise we might have to review where your changes introduce an unwanted behavior.

ssjkamei commented 1 month ago

I don't know how to get the PDF. The corresponding PDF could not be downloaded when cloning the repository.

test:test_unixxx_glyphs

reader = PdfReader(BytesIO(get_data_from_url(name="unixxx_glyphs.pdf")))

test:test_text_extraction_layout_mode

SAMPLE_ROOT / "026-latex-multicolumn/multicolumn.pdf", SAMPLE_ROOT / "010-pdflatex-forms/pdflatex-forms.pdf",

ssjkamei commented 1 month ago

Perhaps the same file was found on the Internet. https://www.aanda.org/articles/aa/pdf/2022/03/aa42891-21.pdf

I am assuming that this is the same file, but in my local environment, it appears to be displayed as 1.′′2 (the dots are reversed). The read distance is also not nearly wrong, and there are no spaces in this. I'm not sure if the files are different or if it depends on the environment, but would you be able to reproduce this? If anyone can reproduce it, I would ask for confirmation from someone who can.

Also, is the output of 1′′.2 correct, although it is output like a ligature in the first place? However, when I checked the original HTML literature that is not PDF, it seemed to be described as 1′′.2.

Any information about other files would be appreciated.

stefan6419846 commented 1 month ago

pypdf uses four different approaches for test files:

  1. Files from the resources directory. Should be obvious from the path inside the test.
  2. Files from the sample-files directory/repository. Should be obvious from the path inside the test.
  3. Files downloaded on the fly during the test. These tests contain a download URL.
  4. Files downloaded before running the tests. They are referenced by their name, for example get_data_from_url(name="iss2077.pdf"). Their URLs are available at https://github.com/py-pdf/pypdf/blob/main/tests/example_files.yaml

In theory, you should not have to care about this however, when you did a regular local development setup for pypdf, including pulling the submodule and downloading the required file set beforehand. This is documented at https://pypdf.readthedocs.io/en/stable/dev/intro.html and https://pypdf.readthedocs.io/en/stable/dev/testing.html

ssjkamei commented 1 month ago

Thanks, that's very helpful. I will download the file and check it out.

ssjkamei commented 1 month ago

I was able to fix it! I think it fixed the whitespace problem in many PDFs.

@stefan6419846 Thanks a lot! @pubpub-zz Thanks for checking! There is a problem with the mysterious number that multiplies by 15, but the code has been changed to assume even individual font sizes that were not seen before. So I don`t feel there is much to worry about.

Please review the code!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 99.13793% with 1 line in your changes missing coverage. Please review.

Project coverage is 96.36%. Comparing base (8e1799e) to head (03eb1cb). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_text_extraction/__init__.py 95.65% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2882 +/- ## ========================================== + Coverage 96.27% 96.36% +0.09% ========================================== Files 52 52 Lines 8689 8721 +32 Branches 1733 1723 -10 ========================================== + Hits 8365 8404 +39 + Misses 187 184 -3 + Partials 137 133 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stefan6419846 commented 1 month ago

Thanks for your PR. I have left some remarks I stumbled upon, but will wait for @pubpub-zz as well to have a look at this.

ssjkamei commented 1 month ago

I don't know how to handle PDFObject, but I would like some help. Most of the old code is not taken into account and can be fixed if you get one pattern.

Are there any rules that should take over for casting PdfObjects? If I refer to _data_structures.py, does it have all the types listed and do I just check all the types that can be cast? Or is there a document that can determine what types /W and /Widths can have?

I would be grateful for tips on how to get started.

ssjkamei commented 1 month ago

I noticed that when creating a combination of character map and width length, there is a pattern where the value after character map conversion is not represented by a single character. Current tests pass, but this is probably a bug.

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

We do not plan to include it in this modification along with the vertical character support (for /W2 and /DW2).

stefan6419846 commented 1 month ago

Or is there a document that can determine what types /W and /Widths can have?

This is documented inside the PDF reference and mapped into actual classes by pypdf. As a general rule of thumb (at least for new code): Besides the expected types, we might have additional IndirectObject references as well.

ssjkamei commented 1 month ago

@stefan6419846 Thanks a lot!

ssjkamei commented 1 month ago

@stefan6419846 I have completed the corrections you indicated. Other omissions regarding single point heights have been added.

if abs(moved_height) > 0.8 str_height scale_prev_y:

Consider the case of a move for a new font: if abs(moved_height) > 0.8 * min(str_height * scale_prev_y, font_size * scale_y):

I have confirmed that the following five issues have been fixed in this bug fix.

1153, #1362, #1974, #2336, #2777

ssjkamei commented 1 month ago

Sorry, there was a coverage error, so I will delete the unnecessary lines.

ssjkamei commented 1 month ago

I fixed it.

ssjkamei commented 1 month ago

@stefan6419846 @pubpub-zz I am very sorry that you reviewed the code, but we have come up with a form of support for the following without including it in the cmap. The code is easier to understand and the process is more efficient. Would it be better to include this as well?

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

stefan6419846 commented 1 month ago

How much of the existing code from this PR would stay the same? If it is just an extension of the code from this PR, I would propose to merge this PR first (which already is an improvement) and do the further improvements in a separate PR to make reviewing easier.

ssjkamei commented 1 month ago

@stefan6419846 Thank you! I even tried to test the changes, but there were too many changes. I will issue a PR once the merge is complete.

ssjkamei commented 1 month ago

It looked like #234 could be closed too.