py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
7.73k stars 1.36k forks source link

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

Closed pubpub-zz closed 1 week ago

pubpub-zz commented 2 weeks ago

Related to #2356

stefan6419846 commented 2 weeks ago

There are three aspects I am not sure about:

pubpub-zz commented 2 weeks ago

There are three aspects I am not sure about:

* Do we really need the `TBC` comments inside the mapping?

The TBC are just here to wait from feed back from @actuary-chen

* If we have possibly public PDF files, shouldn't we add at least a basic test.

I did not focus as this should not be easily subject to regressio on it but I agree it should be better

* We should not close [PdfReader - Extract images from specific pages #2536](https://github.com/py-pdf/pypdf/discussions/2536) with this - there are still unsupported encodings left, as indicated by the "TBC" comments as well.

I dislike the Idea of having a garbage collecting issue on this subject : We need to have some test file to confirm the proper encoding; I prefer new issue to raised on case per case.

pubpub-zz commented 2 weeks ago

I'veremoved all TBC. Let's wait a litte for some feedbacks from @actuary-chen for the last entries

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.14%. Comparing base (a512408) to head (fdbf37c). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2721 +/- ## ======================================= Coverage 95.14% 95.14% ======================================= Files 51 51 Lines 8547 8547 Branches 1703 1703 ======================================= Hits 8132 8132 Misses 261 261 Partials 154 154 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stefan6419846 commented 2 weeks ago

I dislike the Idea of having a garbage collecting issue on this subject : We need to have some test file to confirm the proper encoding; I prefer new issue to raised on case per case.

I initially opened the corresponding issue to discuss how this could be done in general or whether there might be any official test documents which would allow us to cover all cases without having lots of small commits for it.

actuary-chen commented 1 week ago

I can only confirm no wording shows as "pypdf._cmap: implementation of advance cmap ...." However, I cannot make sure whether the text is correct to decode or not, because I use it in the embedding model to a vector database.

codecov[bot] @.***> 於 2024年6月22日 週六 下午6:20寫道:

Codecov https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.14%. Comparing base (a512408) https://app.codecov.io/gh/py-pdf/pypdf/commit/a512408c9559771c5b7e67d9c62de64e09ca4c08?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf to head (fdbf37c) https://app.codecov.io/gh/py-pdf/pypdf/commit/fdbf37c57d9cd2be0ad48ab9ff0bdd12163c2a7d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf . Report is 1 commits behind head on main.

Additional details and impacted files

@@ Coverage Diff @@## main #2721 +/- ##

Coverage 95.14% 95.14%

Files 51 51 Lines 8547 8547 Branches 1703 1703

Hits 8132 8132 Misses 261 261 Partials 154 154

☔ View full report in Codecov by Sentry https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf . 📢 Have feedback on the report? Share it here https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf .

— Reply to this email directly, view it on GitHub https://github.com/py-pdf/pypdf/pull/2721#issuecomment-2183971274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEO7QJAE5IPC76MWIUEYZZTZIVFXTAVCNFSM6AAAAABJWTN6WSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE3TCMRXGQ . You are receiving this because you were mentioned.Message ID: @.***>

actuary-chen commented 1 week ago

It sounds good after I retrieve some texts from the database.

Benjamin Chen @.***> 於 2024年6月23日 週日 上午5:25寫道:

I can only confirm no wording shows as "pypdf._cmap: implementation of advance cmap ...." However, I cannot make sure whether the text is correct to decode or not, because I use it in the embedding model to a vector database.

codecov[bot] @.***> 於 2024年6月22日 週六 下午6:20寫道:

Codecov https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.14%. Comparing base (a512408) https://app.codecov.io/gh/py-pdf/pypdf/commit/a512408c9559771c5b7e67d9c62de64e09ca4c08?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf to head (fdbf37c) https://app.codecov.io/gh/py-pdf/pypdf/commit/fdbf37c57d9cd2be0ad48ab9ff0bdd12163c2a7d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf . Report is 1 commits behind head on main.

Additional details and impacted files

@@ Coverage Diff @@## main #2721 +/- ##

Coverage 95.14% 95.14%

Files 51 51 Lines 8547 8547 Branches 1703 1703

Hits 8132 8132 Misses 261 261 Partials 154 154

☔ View full report in Codecov by Sentry https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf . 📢 Have feedback on the report? Share it here https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf .

— Reply to this email directly, view it on GitHub https://github.com/py-pdf/pypdf/pull/2721#issuecomment-2183971274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEO7QJAE5IPC76MWIUEYZZTZIVFXTAVCNFSM6AAAAABJWTN6WSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE3TCMRXGQ . You are receiving this because you were mentioned.Message ID: @.***>