Closed ssjkamei closed 1 month ago
I forgot one more thing. I don't understand what the 15
specified here means after all.
I forgot one more thing. I don't understand what the
15
specified here means after all.
this is a long time ago and I must admit that I cannot remember why this value. will try to gather my memories
Sorry, I added the code for CIDFont, but it seems to need mapping to cmap[1]
.
Also, I am calculating the font each time, which is very inefficient.
It doesn't seem to support vertical type characters, but since it seems to be from the original, I don't intend to include that in this PR.
CIDFont is supported based on the PDF in the following Issue.
Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336 I think we can close #2336 as well.
The following will remain, but the spaces are moving correctly (because the letters are shaved, the distance traveled is greater than the letters, and the spaces are showing)
Expected: R. Ottokar Doerffel, 1489, Atiradores
Actual: R. O okar Doerff el, 1489, Ar adores
This is probably analogous to the following problem. Ligature issue when converting PDF to text #1351
Dealing with the calculation of fonts each time will be revised later.
Dealing with the calculation of fonts each time will be revised later.
I have divided the functions and put them together. The code still calculates each time. Do you have any good ideas on how to store the data?
I have created a function called build_font_width_map
, which should only need to be called once when checking space. However, it is recalculated every time Tj
in _page.py
is called.
I thought about adding it to the cmap, but I thought this would be a lot of changes and you all would not like it.
Also, I don't know how to fix the following two points.
The number of spaces seems to be partially different in layout mode. Is this acceptable?
I get the following error, but I don't know exactly how to fix it.
pypdf/_cmap.py:464: error: No overload variant of “list” matches argument type “PdfObject” [call-overload].
pypdf/_cmap.py:464: note: Possible overload variants: “PdfObject” [call-overload].
pypdf/_cmap.py:464: note: def [_T] __init__(self) -> list[_T]
pypdf/_cmap.py:464: note: def [_T] __init__(self, Iterable[_T], /) -> list[_T]
Found 2 errors in 1 file (checked 51 source files)
I have not yet been able to see the following properly, but it seems to have been fixed.
I am judging by the position specified by Td and Tm, but perhaps I am judging both as if they were absolute coordinates, > > > causing a misalignment, but I don't know how to correct it correctly.
Target String: สวัสดีชาวโลก
Sorry for the continuous posting.
There is another test case that had an error, which is failing due to a space after the 1. I would like the file for this one to determine if the original is correct, etc. How can I get it?
filename: unixxx_glyphs.pdf
error: 1′′.2
!= 1 ′′.2
@pytest.mark.enable_socket()
def test_unixxx_glyphs():
reader = PdfReader(BytesIO(get_data_from_url(name="unixxx_glyphs.pdf")))
txt = reader.pages[0].extract_text() # no error
for pat in ("闫耀庭", "龚龑", "张江水", "1′′.2"):
> assert pat in txt
I have divided the functions and put them together. The code still calculates each time. Do you have any good ideas on how to store the data? I have created a function called build_font_width_map, which should only need to be called once when checking space. However, it is recalculated every time
Tj
in_page.py
is called.
I don't know if it's good, but I made the correction. As for the other issues, they seem difficult for me to fix.
I think I can fix the Check code style issues
part if you give me some help.
I think the corrected part has been changed to a process closer to the PDF 1.7 specification.
Thanks for the PR.
As for the other issues, they seem difficult for me to fix.
I think I can fix the Check code style issues part if you give me some help.
I have added a corresponding review comment for the code style part.
The tests itself are failing due to your changes to the whitespace handling and thus differing output. It the new output looks good/better, you can update the expected outputs, otherwise we might have to review where your changes introduce an unwanted behavior.
I don't know how to get the PDF. The corresponding PDF could not be downloaded when cloning the repository.
test:test_unixxx_glyphs
reader = PdfReader(BytesIO(get_data_from_url(name="unixxx_glyphs.pdf")))
test:test_text_extraction_layout_mode
SAMPLE_ROOT / "026-latex-multicolumn/multicolumn.pdf", SAMPLE_ROOT / "010-pdflatex-forms/pdflatex-forms.pdf",
Perhaps the same file was found on the Internet. https://www.aanda.org/articles/aa/pdf/2022/03/aa42891-21.pdf
I am assuming that this is the same file, but in my local environment, it appears to be displayed as 1.′′2
(the dots are reversed).
The read distance is also not nearly wrong, and there are no spaces in this.
I'm not sure if the files are different or if it depends on the environment, but would you be able to reproduce this?
If anyone can reproduce it, I would ask for confirmation from someone who can.
Also, is the output of 1′′.2
correct, although it is output like a ligature in the first place?
However, when I checked the original HTML literature that is not PDF, it seemed to be described as 1′′.2
.
Any information about other files would be appreciated.
pypdf uses four different approaches for test files:
resources
directory. Should be obvious from the path inside the test.sample-files
directory/repository. Should be obvious from the path inside the test.get_data_from_url(name="iss2077.pdf")
. Their URLs are available at https://github.com/py-pdf/pypdf/blob/main/tests/example_files.yamlIn theory, you should not have to care about this however, when you did a regular local development setup for pypdf, including pulling the submodule and downloading the required file set beforehand. This is documented at https://pypdf.readthedocs.io/en/stable/dev/intro.html and https://pypdf.readthedocs.io/en/stable/dev/testing.html
Thanks, that's very helpful. I will download the file and check it out.
I was able to fix it! I think it fixed the whitespace problem in many PDFs.
@stefan6419846 Thanks a lot! @pubpub-zz Thanks for checking! There is a problem with the mysterious number that multiplies by 15, but the code has been changed to assume even individual font sizes that were not seen before. So I don`t feel there is much to worry about.
Please review the code!
Attention: Patch coverage is 99.13793%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 96.36%. Comparing base (
8e1799e
) to head (03eb1cb
). Report is 1 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
pypdf/_text_extraction/__init__.py | 95.65% | 0 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for your PR. I have left some remarks I stumbled upon, but will wait for @pubpub-zz as well to have a look at this.
I don't know how to handle PDFObject, but I would like some help. Most of the old code is not taken into account and can be fixed if you get one pattern.
Are there any rules that should take over for casting PdfObjects? If I refer to _data_structures.py, does it have all the types listed and do I just check all the types that can be cast? Or is there a document that can determine what types /W and /Widths can have?
I would be grateful for tips on how to get started.
I noticed that when creating a combination of character map and width length, there is a pattern where the value after character map conversion is not represented by a single character. Current tests pass, but this is probably a bug.
To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.
We do not plan to include it in this modification along with the vertical character support (for /W2
and /DW2
).
Or is there a document that can determine what types /W and /Widths can have?
This is documented inside the PDF reference and mapped into actual classes by pypdf. As a general rule of thumb (at least for new code): Besides the expected types, we might have additional IndirectObject
references as well.
@stefan6419846 Thanks a lot!
@stefan6419846 I have completed the corrections you indicated. Other omissions regarding single point heights have been added.
if abs(moved_height) > 0.8 str_height scale_prev_y:
Consider the case of a move for a new font: if abs(moved_height) > 0.8 * min(str_height * scale_prev_y, font_size * scale_y):
I have confirmed that the following five issues have been fixed in this bug fix.
Sorry, there was a coverage error, so I will delete the unnecessary lines.
I fixed it.
@stefan6419846 @pubpub-zz I am very sorry that you reviewed the code, but we have come up with a form of support for the following without including it in the cmap. The code is easier to understand and the process is more efficient. Would it be better to include this as well?
To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.
How much of the existing code from this PR would stay the same? If it is just an extension of the code from this PR, I would propose to merge this PR first (which already is an improvement) and do the further improvements in a separate PR to make reviewing easier.
@stefan6419846 Thank you! I even tried to test the changes, but there were too many changes. I will issue a PR once the merge is complete.
It looked like #234 could be closed too.
Closes #1153
The change
if abs(moved_height) > 0.8 * f:
at line 129 of thecrlf_space_check
function changes the function to look at both the bottom and top misalignment, but if you prefer not to change it here, I commit the reverted code.The test for hello-world.pdf fails. I think a hidden bug has surfaced, but I don't seem to have read enough of the documentation to determine how to address it from within the PDF 1.7 specification.
I am judging by the position specified by
Td
andTm
, but perhaps I am judging both as if they were absolute coordinates, causing a misalignment, but I don't know how to correct it correctly.Target String:
สวัสดีชาวโลก