kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
206 stars 67 forks source link

Need DPI option. #94

Open eighttails opened 4 years ago

eighttails commented 4 years ago

I want to generate annotated image files to train OCR.

wget https://ia800902.us.archive.org/14/items/arxiv-0704.0646/0704.0646.pdf
pdfalto 0704.0646.pdf 0704.0646.xml

The generated alto file shows page WIDTH and HEIGHT is 612 and 792. It assumes dpi is always 72.

The pdf is vector based and can take any DPI. I generated 300dpi images from the pdf and I want ALTO file as 300dpi. Please consider adding --dpi option to set DPI manually.

kermitt2 commented 4 years ago

Hello @eighttails !

Thank you for the feature request.

Yes, we keep now the PDF point values which are "independent" from any resolution in the ALTO file. Then I was thinking that, like a PDF, the values could be scaled to any resolutions - assuming that the tool using the ALTO file would scale the values accordingly to its needs.

But an ALTO file has indeed normally a "physical" value unit, and adding a --dpi option makes a lot of sense - we will try to add it in a future version.

giancarlobi commented 3 years ago

@kermitt2 I'd like to add a +1 for option -dpi . In our deployment (see https://github.com/esmero/strawberryfield) we are using IAB with Solr highlighting (great module https://github.com/dbmdz/solr-ocrhighlighting) will be really useful to have ALTO with dimensions scaled to pixel (i.e. points / 72 * dpi) to avoid a lot of overhead calculation when rendering. Anyway, thanks a lot again for this great pdfalto command!!