lublak / pdfdataextract

Extract data from a pdf with pure javascript
MIT License
25 stars 5 forks source link

Support for (not parsing) strikethrough #11

Open lkaniak opened 7 months ago

lkaniak commented 7 months ago

Hello,

does this package support a parse option to not get text with strikethrough? Is it feasible?

lublak commented 5 months ago

@lkaniak hi :) Theoretically, it would be possible with version 4.0

https://github.com/lublak/pdfdataextract/issues/10

Unfortunately, I'm very limited in my private life, so I can hardly make any progress.

lublak commented 3 months ago

Just for some documentation: pdf self doesn't have information about strikethrough. It use a path data which is than drawn oveFor documentation purposes only: PDF itself has no information about strikethroughs. Path data is used, which is then drawn over the text. In order to recognise whether texts are crossed out or not, coordinates must be used to check this. I don't think pdfdataextract will offer a function for this. But the complete data information, where which text is and where which path is, can be extracted with the future 4.0 version.