py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.41k stars 1.42k forks source link

MAINT: Generalize the method of obtaining space_code #2891

Closed ssjkamei closed 1 month ago

ssjkamei commented 1 month ago

I made changes to separate the part that gets the space code from the part that analyzes it for clarity. I hope you like it.

I was wondering about the space_code determination for type1_alternative. Perhaps the comparison of the if statement itself is not being used, as I think the content is inadequate.

            if words[2].decode() == b" ":
                space_code = i
codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 96.43%. Comparing base (fcb103a) to head (d6a6346). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2891 +/- ## ========================================== + Coverage 96.36% 96.43% +0.07% ========================================== Files 52 52 Lines 8739 8724 -15 Branches 1727 1721 -6 ========================================== - Hits 8421 8413 -8 + Misses 186 182 -4 + Partials 132 129 -3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ssjkamei commented 1 month ago

At the point where space_code is returned, there is logic that is incorrectly returning Width, and when this part of the code is erased, it no longer enters the code below. I was not sure if I should delete it, but I did because it would cause an error in code coverage. I don't think I will put back no encoding on the code.

The part that was wrong: return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]

def parse_encoding(
    ft: DictionaryObject, space_code: int
) -> Tuple[Union[str, Dict[int, str]], int]:
    encoding: Union[str, List[str], Dict[int, str]] = []
    if "/Encoding" not in ft:
        try:
            if "/BaseFont" in ft and cast(str, ft["/BaseFont"]) in charset_encoding:
                encoding = dict(
                    zip(range(256), charset_encoding[cast(str, ft["/BaseFont"])])
                )
            else:
                encoding = "charmap"
            return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]
        except Exception:
            if cast(str, ft["/Subtype"]) == "/Type1":
                return "charmap", space_code
            else:
                return "", space_code
.....

Codes that were erased:

    # encoding can be either a string for decode
    # (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
    # if empty string, it means it is than encoding field is not present and
    # we have to select the good encoding from cmap input data
    if encoding == "":
        if -1 not in map_dict or map_dict[-1] == 1:
            # I have not been able to find any rule for no /Encoding nor /ToUnicode
            # One example shows /Symbol,bold I consider 8 bits encoding default
            encoding = "charmap"
        else:
            encoding = "utf-16-be"
ssjkamei commented 1 month ago

Please review the code check as it has passed.