invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.79k stars 476 forks source link

Can we define the layout to be used by pdftotext while converting the pdf to text file? #205

Open gtambi143 opened 5 years ago

gtambi143 commented 5 years ago

If i use normal pdftotext it has a option "-simple" using which i can convert a pdf assuming it as a single column pdf. But if I use invoice2data then it converts the pdf by assuming it to be multiple colums. Is there any option for this? Basically I am looking for below statements equivalent in invoice2data: pdftotext -simple file.pdf

m3nu commented 5 years ago

You can add a new argument to the pdf2text wrapper here.

sudeepjd commented 4 years ago

I found this to be a problem for me as well pdftotext conversion currently hardcodes -layout as the conversion method, in some cases, I find -table to be better suited for the conversion... I think this also links with #108

m3nu commented 4 years ago

Yes. Always useful to have it as option. Just keep the current setting as default.

rmilecki commented 1 year ago

The problem I see is that poppler's pdftotext doesn't support -simple or -table. Support for those layouts was developed in Xpdf after it has been forked by the poppler project.

Today most Linux distributions switched to poppler for providing pdftotext and similar tools. Ideally support for those extra layouts should be ported from Xpdf to poppler but that seems like a huge task. I created request issue in poppler project to see if there is any idea/interest around that: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1419