kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
216 stars 70 forks source link

Soft hyphens omitted #149

Open hallsten opened 2 years ago

hallsten commented 2 years ago

First, thanks for a great tool!

I have problems with soft hyphens being omitted:

image

Resulting in:

<TextLine HEIGHT="8.3970" HPOS="88.4409" ID="p1_t24" VPOS="235.408" WIDTH="40.9320">
    <String CONTENT="2332022" HEIGHT="8.3970" HPOS="88.4409" ID="p1_w51" STYLEREFS="font4" VPOS="235.408" WIDTH="40.9320"/>
</TextLine>

Is this intentional? Or would it be possible to replace soft hyphens with regular hyphens?

kermitt2 commented 2 years ago

Hi @hallsten !

Thank you for the issue.

The soft hyphens should not be omitted yes, but normally they should not be visible except at the end of a line. I might be wrong, but your example bitmap looks like including normal hyphen?

Would it be possible to share an error case to work on the problem?

I think the goal would be to have the @CONTENT attribute having a string value with the soft hyphen, but these soft hyphens would not be visible in a text editor.

Replacing the soft hyphen by regular hyphen would really change the string (soft hyphen are just indication where to break a line), so if we really can't manage soft hyphen, I suppose it's better to remove them entirely.

hallsten commented 2 years ago

Thanks for your reply! Here is an example of the problem. soft-hyphens.pdf

pdf2json (https://www.npmjs.com/package/pdf2json) for example parse this string as: 23%C2%AD3%C2%AD2022 and pdftotext would replace the character with <0xad>. Would be great to have some kind of delimiter instead of removing the character, i'm trying to standardize a date and it will be impossible without.

Seehafengepard commented 1 year ago

Thank you for this great tool.

I am looking for a solution to this as well. In my use case it would be better to have the softhyphens all replaced by real/hard hyphens at the exact position. If that would be an option, too just include a parameter that does just that ... some people would be very happy.