kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
207 stars 67 forks source link

Error case, missing digits #163

Open kermitt2 opened 6 months ago

kermitt2 commented 6 months ago

The ALTO file resulting from the attached PDF does not include digits ! Normal xpdf library is working fine.

1909.13722.pdf

Aazhar commented 6 months ago

here are additional examples : 2006.09734.pdf 2001.04340.pdf

clason commented 5 months ago

(Managing editor of that overlay journal here, and the one responsible for the style file 👋 )

Note that copying them from a PDF viewer works fine -- how do you extract these numbers?

clason commented 5 months ago

It seems you're using Xpdf's pdftotext, which fails to read the oldstyle figures. Not sure there's anything that can be done on your side, except replacing it...

clason commented 5 months ago

Aha! It seems that the pdftotext from poppler (https://gitlab.freedesktop.org/poppler/poppler, which is forked from Xpdf 3) can extract oldstyle figures just fine! Maybe you can switch to that (or allow users to provide their own pdftotext)?

1909.13722.txt

clason commented 3 weeks ago

@kermitt2 Any news on this? Is there anything I can help? This is a big issue for us, so I would love to see this resolved.

lfoppiano commented 3 weeks ago

@clason If poppler implements a fork of xpdf3, might be tough to just integrate into it or to give users the ability to plug it in. Are you familiar with C++ programming? I wonder whether xpdf 4.05 could provide a solution, instead

PS: I'm trying to slowly helping to maintain this package, however the time is limited and I'm not a c++ developer, so any help would be mostly appreciated ;-)

clason commented 3 weeks ago

Hard to tell; I just looked at the CLI tools, not the underlying library. (And I tested with xpdf 4.05, that makes no difference.) I'm not a C++ programmer myself.

Maybe it would be possible to allow people to provide a manually converted txt file so users could simply use the correct pdftotext CLI tool as a pre-step in their workflow?

clason commented 3 weeks ago

(If that sweetens the deal: poppler is actually hosted as a public repository so is easier to work with and doesn't need to be vendored.)

kermitt2 commented 3 weeks ago

Hello !

These digit characters correspond to font glyphs that are not mapped correctly to unicode. So what needs to be done is to examine the PDF, identify the font used for these digit characters and look at the problematic unicode mapping for these "digit" values (it might be problematic or missing ToUnicode CMap for this font). We could look why those values were correctly mapped in Xpdf version 3 and not any more in version 4.0 (there are extra mapping and heuristics for this in xpdf).

Moving back to a non-maintained 10 years old version of Xpdf is not a solution ;) ... nor using Poppler I think, given that Xpdf is now very well maintained from version 4.0 (so Poppler not anymore a clearly relevant replacement) and it would mean more or less to rewrite entirely pdfalto.

@clason This will be certainly fixed in pdfalto at some point, but I am wondering if the font used for these digits in your latex package is something particular and could be replace by a more standard font ?

clason commented 3 weeks ago

We could look why those values were correctly mapped in Xpdf version 3 and not any more in version 4.0 (there are extra mapping and heuristics for this in xpdf).

I don't think this is a regression but part of the better support that the poppler fork has received. And for the record: the suggestion was never to revert to Xpdf 3. (I will take your word for it that Xpdf 4 is much better maintained and respondent to issues. Personally, I'm a bit concerned about the commercial entity behind Xpdf and its motivation for open source development -- it would probably also affect the possibility of backporting patches from poppler. But it's your project, and I certainly understand the effort argument.)

This will be certainly fixed in pdfalto at some point, but I am wondering if the font used for these digits in your latex package is something particular and could be replace by a more standard font ?

This is a standard font (linux libertine), and changing it is unfortunately not an option for us (as it's part of the visual identity and chosen deliberately for typographic reasons). Switching now wouldn't help with the already published articles, too. Again, the issue is the use of oldstyle figures, not the font itself.

I am more than willing to help dig into the font mappings, but given that poppler extracts them correctly and Xpdf doesn't indicates that this is something the latter should backport from the former. I am not sufficiently familiar with either project (or C++ in general) to do that, though -- someone else would have to take care of that part. The CharCodeToUnicode.cc has diverged quite a bit...