kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
206 stars 67 forks source link

Is there an option to output ALTO XML to STDOUT? #143

Open Sukii opened 2 years ago

Sukii commented 2 years ago

I need it for a down-stream XSLT pipeline; https://gitlab.coko.foundation/XSweet/XSweet/-/tree/pdf2html/applications/pdf2html

kermitt2 commented 2 years ago

Hello @Sukii !

There is no such option currently. As the normal use case is to produce several files in addition to the ATLO document to cover information in the PDF that cannot be encoded in ALTO (for annotations, outline, ...), I didn't plan to add it so far. I guess working with files is no problem, the interest of using pipes with stdout/stdin would be to speed up a bit the XSTL transformation?

Sukii commented 2 years ago

Yes, not only the speed improvement, but also that Linux pipes help in sending the output directly to the webservices avoiding possible collisions, racing conditions etc. Of course, the images and stuff like that better remain outside as binary files, so it may be necessary to write that to hard-disk anyway.

Sukii commented 2 years ago

https://gitlab.coko.foundation/XSweet/XSweet/-/tree/pdf2html/applications/pdf2html