Open hallsten opened 2 years ago
Hi @hallsten !
Thank you for the issue.
The soft hyphens should not be omitted yes, but normally they should not be visible except at the end of a line. I might be wrong, but your example bitmap looks like including normal hyphen?
Would it be possible to share an error case to work on the problem?
I think the goal would be to have the @CONTENT
attribute having a string value with the soft hyphen, but these soft hyphens would not be visible in a text editor.
Replacing the soft hyphen by regular hyphen would really change the string (soft hyphen are just indication where to break a line), so if we really can't manage soft hyphen, I suppose it's better to remove them entirely.
Thanks for your reply! Here is an example of the problem. soft-hyphens.pdf
pdf2json (https://www.npmjs.com/package/pdf2json) for example parse this string as: 23%C2%AD3%C2%AD2022 and pdftotext would replace the character with <0xad>. Would be great to have some kind of delimiter instead of removing the character, i'm trying to standardize a date and it will be impossible without.
Thank you for this great tool.
I am looking for a solution to this as well. In my use case it would be better to have the softhyphens all replaced by real/hard hyphens at the exact position. If that would be an option, too just include a parameter that does just that ... some people would be very happy.
First, thanks for a great tool!
I have problems with soft hyphens being omitted:
Resulting in:
Is this intentional? Or would it be possible to replace soft hyphens with regular hyphens?