Closed a0js closed 7 months ago
Hi there, much appreciate the kind words, thank you! Glad it's inspiring you. Since I wrote all that up, I actually spend no more than a few seconds to do my imports :).
A pdfreader
sounds great. However, I'd caution that extracting clean, usable, structured data from pdfs has been an elusive goal in my experience. In particular, though it's not too hard to do it for one or two specific institutions, it's very difficult to write a generic reader that works across any pdf thrown at it, if that's what you're suggesting. Take all this with a grain of salt though, as it's been a while since I looked into extracting from pdfs.
Also in my experience, pdftotext is surprisingly good at converting a pdf to text, which can then be parsed with standard tools (grep, awk, or a python function). As a practicality, the amount of time spent writing say, five individual importers that use pdftotext might be far less than writing one generic pdfreader and getting it work work for each case. This is exactly what I used to do for my paystubs, and it worked reasonably well.
A few other thoughts:
All that said, pdfplumber seems interesting, and possibly a better way to extract than pdftotext, which might well be the optimal solution for pdfs. I hadn't heard of it until you mentioned it. I'd say feel free to experiment with it, and if it is indeed able to extract data easily, that might indeed work well. Having a pdfreader extract just the tables would be a great start. Extracting things outside the table can be done in individual importers since that would vary widely among individual importers.
Nice suggestion overall, I look forward to anything you might come up with, and do let me know if you have further thoughts!
First, thank you so much for your blog, and your importers and everything. I've been using beancount for 3 years now, and its just gotten too much for me to input stuff. Your blog inspired me to try again and to be a lot smarter about it this time.
I had an idea for a
pdfreader
module. My thought was to utilize the pdfplumber package, to extract tables from pdfs, and something like this filter to capture all the data outside of the tables as well. I'd like to also search for everything that looks like a date in outside-of-table information, and return those in a list. And then everything that looks like a currency outside of the tables as well in a list. Perhaps we can put all that outside-of-table information into a metadata table and pass all of the data on as a multi table format to downstream formatters.My primary concern right now is reading my paystub pdf, but I think it could also be useful for parsing pdf statements. Thoughts?