Open eighttails opened 4 years ago
Hello @eighttails !
Thank you for the feature request.
Yes, we keep now the PDF point values which are "independent" from any resolution in the ALTO file. Then I was thinking that, like a PDF, the values could be scaled to any resolutions - assuming that the tool using the ALTO file would scale the values accordingly to its needs.
But an ALTO file has indeed normally a "physical" value unit, and adding a --dpi
option makes a lot of sense - we will try to add it in a future version.
@kermitt2 I'd like to add a +1 for option -dpi . In our deployment (see https://github.com/esmero/strawberryfield) we are using IAB with Solr highlighting (great module https://github.com/dbmdz/solr-ocrhighlighting) will be really useful to have ALTO with dimensions scaled to pixel (i.e. points / 72 * dpi) to avoid a lot of overhead calculation when rendering. Anyway, thanks a lot again for this great pdfalto command!!
I want to generate annotated image files to train OCR.
The generated alto file shows page WIDTH and HEIGHT is 612 and 792. It assumes dpi is always 72.
The pdf is vector based and can take any DPI. I generated 300dpi images from the pdf and I want ALTO file as 300dpi. Please consider adding --dpi option to set DPI manually.