pdfreader only works with the paycheck transaction builder

gary-roach commented 1 week ago

I have a need to get the pdfreader to work as a reader to read in transactions. I have an off brand credit card that only exports the transactions on the credit card statement as a pdf and there's no other export option.

The pdfreader was built around reading paychecks, but not investment, banking or other import modules. This seems counter intuitive to the pattern of the readers. The rest of the framework is built around having a reader which is transaction agnostic to read the data and combining it with a transaction builder to process the specific transaction type.

Python isn't my daily driver, but I'm trying to take a stab at correcting this. I'm not entirely sure how this should be structured to correct this. Is there an example of a reader that works with all of the transaction builders?

redstreet commented 1 week ago

The pdfreader should be transaction agnostic like the other readers. I think there are perhaps a few things that might be confusing:

its output is self.alltables, though is it based on csvreader. It should really be based on csv_multitable_reader
there is a debug section which uses paycheck_template, but this is just for debugging
other readers implement get_transactions(). Since pdfreader is similar to csv_multitable_reader in that it parses and reads multiple tables, get_transactions() doesn't make sense for it, which means it is best implemented inside the specific importer

You should be able to use it as it is right now just like the other readers apart from the exceptions above. Does that help?

@a0js, feel free to weigh in.

gary-roach commented 1 week ago

Thanks for you reply @redstreet . For my specific case, I actually only have one final table that needs to be imported as it's a list of transactions from a pdf.

The main problem is that the banking transaction builder makes a call to get_transactions() and the pdfreader never overwrote that which is fine, but read_file() also never set the value of self.rdr (which expects a single table) so it fails when the banking transaction builder calls it. If the intention is for the importer to override this class the pdfreader should probably define an empty version that throws a not implemented exception.

The other issue I had is that I needed a dynamic crop position for each page to captures the tables that I needed (There was multiple transaction tables in the pdf but they all had the same headers). This means I had to override read_file() which is really the bulk of the pdfreader class. Since I was overriding this anyways, I also took the opportunity to convert the multiple tables into a single table and set the value of self.rdr. This worked, but I had to rewrite the entire pdfreader - I ended up just making a reader specific to my institution, but I don't love that it's not reusable .

To be able to reuse the pdfreader effectively maybe it could benefit from these change:

The original version is dealing with multiple tables, so it should implement csv_multitable_reader like you suggested. Should this be renamed to pdf_multitable_reader?
Does that mean we need a separate single table version pdfreader that is based on csvreader?
The read_file() method is doing too much in a single method which means if something needs to change the entire thing has to be overwritten. This should probably be broken up into smaller more single purpose methods that can each be overwritten individually based on the institution specific needs.
- get_adjsuted_crop(page_index, page)# returns the adjusted crop coordinate's for the page
- extract_tables(page_index, page) # extracts the tables from the page
- extract_metadata(page_index, page) # extracts the data outside of tables
- attach_section_headers(tables) # attaches section headers to each table
- find_and_fix_broken_tables(tables) # fixes any tables broken up by page breaks
- prepare_tables() # this already exists- make any changes to tables before passing to transaction builder
- debug() # creates all of the debug files and images

Let me know what you think. I'd be happy to help implement these changes.

Also, this may be unrelated to the pdfreader itself, but the importer that I wrote doesn't seem to work with Fava. It works fine with Bean-Identify and Bean-Extract. I created an issue for it on the Fava project, but it may be related to something missing in the importer/reader. I'm not sure how to go about troubleshooting that in Fava. Any advice there would be appreciated.

redstreet commented 1 week ago

The main problem is that the banking transaction builder makes a call to get_transactions() and the pdfreader never overwrote that which is fine, but read_file() also never set the value of self.rdr (which expects a single table) so it fails when the banking transaction builder calls it. If the intention is for the importer to override this class the pdfreader should probably define an empty version that throws a not implemented exception.

Agreed, done in 3cab851

The original version is dealing with multiple tables, so it should implement csv_multitable_reader like you suggested. Should this be renamed to pdf_multitable_reader?

Makes sense. Reg the multiple vs single tables issue: single_table_readers (like csvreader) are simply instances of multitable readers with a single table. Eventually, csvreader and csv_multitable_reader should me merged. So IMO, it'd be best for the pdfreader continue to support multiple tables, and in your case, you'll simply read the first and only table in a list of length one.

The read_file() method is doing too much in a single method which means if something needs to change the entire thing has to be overwritten. This should probably be broken up into smaller more single purpose methods that can each be overwritten individually based on the institution specific needs.

Agree, and I like your refactoring plan, and I'd be happy to review a PR for it.

Refactoring like you broke it down is just good software engineering, and that'd be great. However, the genericpdf reader's read_file() should support single tables equally well as multiple tables. Pushing the reading complexity into the reader and making one awesome pdfreader is what would allow the rest of the importers to remain simple, not have to worry about reimplementing reading from a pdf or overriding read_file(). So I'd see this as the important part to address first. Thoughts?

Also, this may be unrelated to the pdfreader itself, but the importer that I wrote doesn't seem to work with Fava. It works fine with Bean-Identify and Bean-Extract. I created an issue for it on the Fava project, but it may be related to something missing in the importer/reader. I'm not sure how to go about troubleshooting that in Fava. Any advice there would be appreciated.

Hmm, unfortunately, I've never used importers with Fava and don't know much about it. However, I'd start debugging by git-grepping for that error you saw "Loading import failed with error" in the Fava code base, and seeing how it calls the importer and what it expects. Happy to help as you continue to debug if needed.

gary-roach commented 6 days ago

Refactoring like you broke it down is just good software engineering, and that'd be great. However, the genericpdf reader's read_file() should support single tables equally well as multiple tables. Pushing the reading complexity into the reader and making one awesome pdfreader is what would allow the rest of the importers to remain simple, not have to worry about reimplementing reading from a pdf or overriding read_file(). So I'd see this as the important part to address first. Thoughts?

That makes sense. Multiple tables in which the headers are the same (or a split because of a page break) can be combined into one table and tables that have different headers can be treated as extra tables and processed as such by csv_multitable_reader. Let me take a stab at this.

a0js commented 6 days ago

I'd be happy to review the refactor also. It certainly could be decomposed to make it more extensible.

gary-roach commented 6 days ago

@redstreet @a0js Do either of you have a preference on where the importer example / tests should be placed? I would like to add an example pdf importer that works for transactions.

My first thought was to place it under the importers\genericpdf folder, but since the underlying __int__.py is a paycheck importer this is really a generic_pdf_paycheck importer which makes the transaction one feel a bit out of place.

Would it make more sense to put the example importer and tests under the institution that spawned this use case: importers\mercurycards?

redstreet commented 6 days ago

@redstreet @a0js Do either of you have a preference on where the importer example / tests should be placed? I would like to add an example pdf importer that works for transactions.

My first thought was to place it under the importers\genericpdf folder, but since the underlying __int__.py is a paycheck importer this is really a generic_pdf_paycheck importer which makes the transaction one feel a bit out of place.

Would it make more sense to put the example importer and tests under the institution that spawned this use case: importers\mercurycards?

Yes, that would be the preferred place for importer specific tests.

genericpdf is an exception, as it is a template-importer for users to build their own importer from. Thanks for bringing this up, this made me realize genericpdf is probably best renamed to genericpdfpaycheck. Will do in a minute.

gary-roach commented 1 day ago

@redstreet @a0js I've submitted a PR for these changes in #113 . Please let me know your thoughts.

Also, sorry about the multiple PRs. I had to run the formatters from the quality gate, but the gates didn't re-run when I pushed the additional commit. Not sure what I was doing wrong there.

redstreet / beancount_reds_importers

pdfreader only works with the paycheck transaction builder #109