pdfreader as a new file format reader

First, thank you so much for your blog, and your importers and everything. I've been using beancount for 3 years now, and its just gotten too much for me to input stuff. Your blog inspired me to try again and to be a lot smarter about it this time.

I had an idea for a pdfreader module. My thought was to utilize the pdfplumber package, to extract tables from pdfs, and something like this filter to capture all the data outside of the tables as well. I'd like to also search for everything that looks like a date in outside-of-table information, and return those in a list. And then everything that looks like a currency outside of the tables as well in a list. Perhaps we can put all that outside-of-table information into a metadata table and pass all of the data on as a multi table format to downstream formatters.

My primary concern right now is reading my paystub pdf, but I think it could also be useful for parsing pdf statements. Thoughts?

Hi there, much appreciate the kind words, thank you! Glad it's inspiring you. Since I wrote all that up, I actually spend no more than a few seconds to do my imports :).

A pdfreader sounds great. However, I'd caution that extracting clean, usable, structured data from pdfs has been an elusive goal in my experience. In particular, though it's not too hard to do it for one or two specific institutions, it's very difficult to write a generic reader that works across any pdf thrown at it, if that's what you're suggesting. Take all this with a grain of salt though, as it's been a while since I looked into extracting from pdfs.

Also in my experience, pdftotext is surprisingly good at converting a pdf to text, which can then be parsed with standard tools (grep, awk, or a python function). As a practicality, the amount of time spent writing say, five individual importers that use pdftotext might be far less than writing one generic pdfreader and getting it work work for each case. This is exactly what I used to do for my paystubs, and it worked reasonably well.

A few other thoughts:

the amount of effort one puts into pdf extractors, only justified if one has a lot of institutions that provide data exclusively via pdfs, and not via csv, ofx etc. Hopefully, that is changing, and most institutions around the world should be providing at least csvs if not xml, json, ofx, etc.
Generative AI might get extremely good in the near future at extracting data out of pdfs. It already may be. Leveraging that in your importer might be an idea

All that said, pdfplumber seems interesting, and possibly a better way to extract than pdftotext, which might well be the optimal solution for pdfs. I hadn't heard of it until you mentioned it. I'd say feel free to experiment with it, and if it is indeed able to extract data easily, that might indeed work well. Having a pdfreader extract just the tables would be a great start. Extracting things outside the table can be done in individual importers since that would vary widely among individual importers.

Nice suggestion overall, I look forward to anything you might come up with, and do let me know if you have further thoughts!

redstreet / beancount_reds_importers

pdfreader as a new file format reader #93