py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
7.73k stars 1.36k forks source link

Functionality of b_ #2726

Open j-t-1 opened 1 week ago

j-t-1 commented 1 week ago
B_CACHE: Dict[Union[str, bytes], bytes] = {}

def b_(s: Union[str, bytes]) -> bytes:
    if isinstance(s, bytes):
        return s
    bc = B_CACHE
    if s in bc:
        return bc[s]
    try:
        r = s.encode("latin-1")
        if len(s) < 2:
            bc[s] = r
        return r
    except Exception:
        r = s.encode("utf-8")
        if len(s) < 2:
            bc[s] = r
        return r

It is non-obvious why we are only caching strings of length zero (the empty string) and of length one.

There is a need to add comments to this function.

pubpub-zz commented 1 week ago

I'm not sure this is at will. you may found a way to improve performances...

stefan6419846 commented 1 week ago

This has been added/changed in #124 by @mozbugbox.

j-t-1 commented 1 week ago

This has been added/changed in #124 by @mozbugbox.

The first of the commit messages is "Speedup b_() for short string for Python3", which suggests the current functionality is deliberate.

The code tries latin-1 and if there is an exception uses utf-8. Given the rise of utf-8, I think we should just use utf-8.

stefan6419846 commented 1 week ago

@j-t-1 I recommend you to further dig into the commit history if unsure about specific cases - the latin-1 workaround has been required for some files. Doing a "blame" shows you that the corresponding change is from #463 and there already has been a request for an example file about two months ago: https://github.com/py-pdf/pypdf/pull/463#issuecomment-2067587535

j-t-1 commented 3 days ago

@pubpub-zz
Line 226 of test_generic.py: # to test latin-1 aka stdencoding

Is this equivalent to this (and if so we could change it)? # to test PDFDocEncoding (latin-1)

pubpub-zz commented 3 days ago

Aka =also known as Your rewording is correct