opendocument-app / OpenDocument.droid

It's Android's first OpenOffice Document Reader!
https://opendocument.app/
Other
266 stars 36 forks source link

Text is Not Rendered in Certain PDFs #348

Closed TechD123 closed 4 days ago

TechD123 commented 5 months ago

Since updating from 3.19 to the recently published 3.26 (via F-Droid), certain PDFs don't render any text. As my main use case for this app is its PDF reading feature, I have reverted to the previous version.

I'll share a link to a sample PDF once I find one that's affected. So far I've only seen this on PDFs that would leak PII if I shared them here... :P

andiwand commented 5 months ago

Thank you for reporting this @TechD123 ! If you don't want to share the files publicly you can send them to us via mail if that is an option. Otherwise we can wait for another file which has this issue

TomTasche commented 5 months ago

I've seen the same behavior, but can't share those PDFs publicly either. @andiwand @ViliusSutkus89 let me know if you are interested in taking a look.

andiwand commented 5 months ago

can you put it into the private testing repo? otherwise you can mail it to me @TomTasche

ViliusSutkus89 commented 5 months ago

If I could reproduce it, I could tell if the problem is in upstream pdf2htmlEX or if it's on our side

ViliusSutkus89 commented 4 months ago

@TomTasche tried those pdf's on pdf2htmlEX docker image available on docker hub and it gives the same result. Although that official pdf2htmlEX release docker image is 4 years old and using outdated Poppler and FontForge.

I've inspected the HTML DOM, all the text is there, but hidden by CSS. I assume it's a font issue, will try to debug

TomTasche commented 4 months ago

Since PDF is quite important for our users, I'd actually propose to set a bounty for this issue. @andiwand @ViliusSutkus89

I'd love to have a fix for this soon.

ViliusSutkus89 commented 4 months ago

pdf2htmlEX-Android conan is nearly there. Then I can focus on debugging upstream pdf2htmlEX

ViliusSutkus89 commented 4 months ago

Did some debugging, usually this error happens when fontforge errors out while trying to save a "malformed" font. Don't know how much of that malform is actually bad font and how much of that is bad parsing. Will see what I can do to workaround

TomTasche commented 3 months ago

A fix for that is rolling out now, thanks a lot @ViliusSutkus89 ! :tada:

TechD123 commented 2 weeks ago

After the aforementioned 3.26, the next/latest version on F-Droid is now 3.31 which I just gave a spin (was on 3.19 the entire time before).

Unfortunately, a similar issue has appeared, where certain PDFs will either open on a blank screen or render only seemingly random words or letters. I have attached a few for convenience, but the error rate seems far higher than 3.26 based on my gut feeling. verylong.pdf long.pdf short.pdf

I hope you'll give it another look :)

ViliusSutkus89 commented 2 weeks ago

True. Can reproduce on all three files

ViliusSutkus89 commented 1 week ago

OK, I've debugged the issue. When opening documents we first try to load them through OdfLoader, if that fails, only then we try pdf2htmlEX-Android or other loaders. For most of the pdf documents OdfLoader fails. But for these documents, OdfLoader doesn't fail, and since it doesn't fail, pdf2htmlEX isn't called. Yes, the output is totally broken when parsed by OdfLoader, but OdfLoader thinks that it's ok.

I've also noticed that we haven't updated pdf2htmlEX and wvWare versions here, but it's not the cause of the issue.

This issue will automagically go away once we solve #369 , which is blocked by https://github.com/opendocument-app/OpenDocument.core/pull/387 . Will try to get on it this week

andiwand commented 1 week ago

@TomTasche should we just surround this by detecting if it is a PDF on the platform side?

TomTasche commented 1 week ago

OK, I've debugged the issue. When opening documents we first try to load them through OdfLoader, if that fails, only then we try pdf2htmlEX-Android or other loaders. For most of the pdf documents OdfLoader fails. But for these documents, OdfLoader doesn't fail, and since it doesn't fail, pdf2htmlEX isn't called. Yes, the output is totally broken when parsed by OdfLoader, but OdfLoader thinks that it's ok.

Thanks for the investigation! I found a (dirty) workaround for now in https://github.com/opendocument-app/OpenDocument.droid/pull/377 and am rolling it out as we speak.

@andiwand can you check why the core does not fail for these PDFs? I can't remember exactly, but I don't think that's "working as intended".

TomTasche commented 1 week ago

(Note that my change does not fix the original issue reported here. I think that's an actual bug in the PDF parser? @ViliusSutkus89)

ViliusSutkus89 commented 1 week ago

@TomTasche , which one looks weird? Just tried v3.32 app-pro-release.apk from GitHub releases and all 4 files (verylong.pdf, long.pdf, short.pdf and the emailed yourdocument.pdf) look decent to me.

TomTasche commented 1 week ago

Nice! I haven't tested the documents in this thread, because I assumed they were caused by a different issue.

@TechD123 can you confirm this is resolved with the latest version?

andiwand commented 1 week ago

@andiwand can you check why the core does not fail for these PDFs? I can't remember exactly, but I don't think that's "working as intended".

I think the current version of the core will try to open and translate all PDFs. We have no way of telling it not to try. I think the only way is to catch this upfront.

TechD123 commented 5 days ago

Can confirm that PDFs open flawlessly with 3.32-pro. Really appreciate the quick support and hoping it'll be available in F-Droid soon.

The only real gripe I have now is how it takes a couple seconds to load PDFs, but this certainly is a minor issue in comparison.

TomTasche commented 4 days ago

The only real gripe I have now is how it takes a couple seconds to load PDFs, but this certainly is a minor issue in comparison.

That should be somewhat addressed by https://github.com/opendocument-app/OpenDocument.droid/pull/369 and all the improvements that would follow after that.