redstreet / beancount_reds_importers

Simple ingesting tools for Beancount (plain text, double entry accounting software). More importantly, a framework to allow you to easily write your own importers.
GNU General Public License v3.0
115 stars 39 forks source link

pdfreader as a new file format reader #93

Closed a0js closed 7 months ago

a0js commented 8 months ago

First, thank you so much for your blog, and your importers and everything. I've been using beancount for 3 years now, and its just gotten too much for me to input stuff. Your blog inspired me to try again and to be a lot smarter about it this time.

I had an idea for a pdfreader module. My thought was to utilize the pdfplumber package, to extract tables from pdfs, and something like this filter to capture all the data outside of the tables as well. I'd like to also search for everything that looks like a date in outside-of-table information, and return those in a list. And then everything that looks like a currency outside of the tables as well in a list. Perhaps we can put all that outside-of-table information into a metadata table and pass all of the data on as a multi table format to downstream formatters.

My primary concern right now is reading my paystub pdf, but I think it could also be useful for parsing pdf statements. Thoughts?

redstreet commented 8 months ago

Hi there, much appreciate the kind words, thank you! Glad it's inspiring you. Since I wrote all that up, I actually spend no more than a few seconds to do my imports :).

A pdfreader sounds great. However, I'd caution that extracting clean, usable, structured data from pdfs has been an elusive goal in my experience. In particular, though it's not too hard to do it for one or two specific institutions, it's very difficult to write a generic reader that works across any pdf thrown at it, if that's what you're suggesting. Take all this with a grain of salt though, as it's been a while since I looked into extracting from pdfs.

Also in my experience, pdftotext is surprisingly good at converting a pdf to text, which can then be parsed with standard tools (grep, awk, or a python function). As a practicality, the amount of time spent writing say, five individual importers that use pdftotext might be far less than writing one generic pdfreader and getting it work work for each case. This is exactly what I used to do for my paystubs, and it worked reasonably well.

A few other thoughts:

All that said, pdfplumber seems interesting, and possibly a better way to extract than pdftotext, which might well be the optimal solution for pdfs. I hadn't heard of it until you mentioned it. I'd say feel free to experiment with it, and if it is indeed able to extract data easily, that might indeed work well. Having a pdfreader extract just the tables would be a great start. Extracting things outside the table can be done in individual importers since that would vary widely among individual importers.

Nice suggestion overall, I look forward to anything you might come up with, and do let me know if you have further thoughts!