Open gary-roach opened 1 week ago
The pdfreader should be transaction agnostic like the other readers. I think there are perhaps a few things that might be confusing:
self.alltables
, though is it based on csvreader
. It should really be based on csv_multitable_reader
debug
section which uses paycheck_template
, but this is just for debuggingget_transactions()
. Since pdfreader is similar to csv_multitable_reader
in that it parses and reads multiple tables, get_transactions()
doesn't make sense for it, which means it is best implemented inside the specific importerYou should be able to use it as it is right now just like the other readers apart from the exceptions above. Does that help?
@a0js, feel free to weigh in.
Thanks for you reply @redstreet . For my specific case, I actually only have one final table that needs to be imported as it's a list of transactions from a pdf.
The main problem is that the banking transaction builder makes a call to get_transactions()
and the pdfreader
never overwrote that which is fine, but read_file()
also never set the value of self.rdr
(which expects a single table) so it fails when the banking transaction builder calls it. If the intention is for the importer to override this class the pdfreader
should probably define an empty version that throws a not implemented exception.
The other issue I had is that I needed a dynamic crop position for each page to captures the tables that I needed (There was multiple transaction tables in the pdf but they all had the same headers). This means I had to override read_file()
which is really the bulk of the pdfreader
class. Since I was overriding this anyways, I also took the opportunity to convert the multiple tables into a single table and set the value of self.rdr
. This worked, but I had to rewrite the entire pdfreader
- I ended up just making a reader specific to my institution, but I don't love that it's not reusable .
To be able to reuse the pdfreader
effectively maybe it could benefit from these change:
csv_multitable_reader
like you suggested. Should this be renamed to pdf_multitable_reader
?pdfreader
that is based on csvreader
?read_file()
method is doing too much in a single method which means if something needs to change the entire thing has to be overwritten. This should probably be broken up into smaller more single purpose methods that can each be overwritten individually based on the institution specific needs.
get_adjsuted_crop(page_index, page)
# returns the adjusted crop coordinate's for the pageextract_tables(page_index, page)
# extracts the tables from the pageextract_metadata(page_index, page)
# extracts the data outside of tables attach_section_headers(tables)
# attaches section headers to each tablefind_and_fix_broken_tables(tables)
# fixes any tables broken up by page breaksprepare_tables()
# this already exists- make any changes to tables before passing to transaction builderdebug()
# creates all of the debug files and imagesLet me know what you think. I'd be happy to help implement these changes.
Also, this may be unrelated to the pdfreader
itself, but the importer that I wrote doesn't seem to work with Fava. It works fine with Bean-Identify and Bean-Extract. I created an issue for it on the Fava project, but it may be related to something missing in the importer/reader. I'm not sure how to go about troubleshooting that in Fava. Any advice there would be appreciated.
The main problem is that the banking transaction builder makes a call to
get_transactions()
and thepdfreader
never overwrote that which is fine, butread_file()
also never set the value ofself.rdr
(which expects a single table) so it fails when the banking transaction builder calls it. If the intention is for the importer to override this class thepdfreader
should probably define an empty version that throws a not implemented exception.
Agreed, done in 3cab851
- The original version is dealing with multiple tables, so it should implement
csv_multitable_reader
like you suggested. Should this be renamed topdf_multitable_reader
?
Makes sense. Reg the multiple vs single tables issue: single_table_readers (like csvreader) are simply instances of multitable readers with a single table. Eventually, csvreader and csv_multitable_reader should me merged. So IMO, it'd be best for the pdfreader continue to support multiple tables, and in your case, you'll simply read the first and only table in a list of length one.
- The
read_file()
method is doing too much in a single method which means if something needs to change the entire thing has to be overwritten. This should probably be broken up into smaller more single purpose methods that can each be overwritten individually based on the institution specific needs.
Agree, and I like your refactoring plan, and I'd be happy to review a PR for it.
Refactoring like you broke it down is just good software engineering, and that'd be great. However, the genericpdf reader's read_file()
should support single tables equally well as multiple tables. Pushing the reading complexity into the reader and making one awesome pdfreader is what would allow the rest of the importers to remain simple, not have to worry about reimplementing reading from a pdf or overriding read_file()
. So I'd see this as the important part to address first. Thoughts?
Also, this may be unrelated to the
pdfreader
itself, but the importer that I wrote doesn't seem to work with Fava. It works fine with Bean-Identify and Bean-Extract. I created an issue for it on the Fava project, but it may be related to something missing in the importer/reader. I'm not sure how to go about troubleshooting that in Fava. Any advice there would be appreciated.
Hmm, unfortunately, I've never used importers with Fava and don't know much about it. However, I'd start debugging by git-grepping for that error you saw "Loading import failed with error" in the Fava code base, and seeing how it calls the importer and what it expects. Happy to help as you continue to debug if needed.
Refactoring like you broke it down is just good software engineering, and that'd be great. However, the genericpdf reader's
read_file()
should support single tables equally well as multiple tables. Pushing the reading complexity into the reader and making one awesome pdfreader is what would allow the rest of the importers to remain simple, not have to worry about reimplementing reading from a pdf or overridingread_file()
. So I'd see this as the important part to address first. Thoughts?
That makes sense. Multiple tables in which the headers are the same (or a split because of a page break) can be combined into one table and tables that have different headers can be treated as extra tables and processed as such by csv_multitable_reader
.
Let me take a stab at this.
I'd be happy to review the refactor also. It certainly could be decomposed to make it more extensible.
@redstreet @a0js Do either of you have a preference on where the importer example / tests should be placed? I would like to add an example pdf importer that works for transactions.
My first thought was to place it under the importers\genericpdf
folder, but since the underlying __int__.py
is a paycheck importer this is really a generic_pdf_paycheck
importer which makes the transaction one feel a bit out of place.
Would it make more sense to put the example importer and tests under the institution that spawned this use case: importers\mercurycards
?
@redstreet @a0js Do either of you have a preference on where the importer example / tests should be placed? I would like to add an example pdf importer that works for transactions.
My first thought was to place it under the
importers\genericpdf
folder, but since the underlying__int__.py
is a paycheck importer this is really ageneric_pdf_paycheck
importer which makes the transaction one feel a bit out of place.Would it make more sense to put the example importer and tests under the institution that spawned this use case:
importers\mercurycards
?
Yes, that would be the preferred place for importer specific tests.
genericpdf
is an exception, as it is a template-importer for users to build their own importer from. Thanks for bringing this up, this made me realize genericpdf
is probably best renamed to genericpdfpaycheck
. Will do in a minute.
@redstreet @a0js I've submitted a PR for these changes in #113 . Please let me know your thoughts.
Also, sorry about the multiple PRs. I had to run the formatters from the quality gate, but the gates didn't re-run when I pushed the additional commit. Not sure what I was doing wrong there.
I have a need to get the pdfreader to work as a reader to read in transactions. I have an off brand credit card that only exports the transactions on the credit card statement as a pdf and there's no other export option.
The pdfreader was built around reading paychecks, but not investment, banking or other import modules. This seems counter intuitive to the pattern of the readers. The rest of the framework is built around having a reader which is transaction agnostic to read the data and combining it with a transaction builder to process the specific transaction type.
Python isn't my daily driver, but I'm trying to take a stab at correcting this. I'm not entirely sure how this should be structured to correct this. Is there an example of a reader that works with all of the transaction builders?