jbms / beancount-import

Web UI for semi-automatically importing external data into beancount
GNU General Public License v2.0
393 stars 103 forks source link

Import from bean-extract / beancount.ingest? #23

Closed wadabum closed 4 years ago

wadabum commented 5 years ago

So, I´ve spend hours writing a bean-extract compatible parser using beancount.ingest.importer.ImporterProtocol

Can one somehow string those things together?


something along the lines of the usual:

# beancount.config
from importers import my_importer
configured_my_importer = my_importer.Importer(
  currency='USD',
)
CONFIG = [
  configured_my_importer,
]
# ....
# ....
# ....
    data_sources = [
        dict(
            module='beancount_import.source.beancount-ingest',
            class=configured_my_importer,
        )
# or
    data_sources = [
        dict(
            module='beancount_import.source.beancount-extract',
            config='path/to/beancount.config', # parse&read CONFIG from there
            dirs='documents/',
        )

or however that´s gonna map out... ;-)

(think that could proxy-close #18 as well...)

jbms commented 5 years ago

There isn't any existing way to do this, but some sort of convergence with beancount.ingest is something I'm very interested in exploring. However, there are some challenges to overcome:

  1. beancount.ingest is based on generating partial transactions (that don't necessarily balance) with missing postings. beancount-import requires full transactions, with Expenses:FIXME postings to indicate unknown accounts (a single Expenses:FIXME posting is allowed to correspond to multiple unknown postings). This is simple to handle, though --- just generate Expenses:FIXME postings to balance the transaction.

  2. With just step 1, you could have a partial integration with beancount-import, but to get the automatic tracking of which transactions have already been imported, and automatic classification of unknown accounts, the beancount-import data source needs to provide more information than is available in the beancount.ingest model.

dumbPy commented 4 years ago

How about exporting this required info in meta? I am currently exporting transaction ref/ cheque No. and jumbled description into meta. ML models that can use this to predict postings and use your awesome UI for manual corrections would be great. currently exploring smart_importer but it doesn't have a way to fix interactively like this project. I too want this to work.

jbms commented 4 years ago

If you specified which metadata fields indicate a unique transaction id, and which metadata fields to use for automatic classification, then I think it would work.

dumbPy commented 4 years ago

Do you have any plans for defining API for the UI part?

I am currently using smart_importer and I think having an interactive UI makes a lot of difference. Having a clearly defined API would mean making UI compatible with all the future ML models. My statements have a lot of numbers and plain text and comments and I would surely love to try some simple NLP models like Spacy's classifier.

I personally loved the idea of being able to make some manual entries and later merge them effortlessly with the statement.

jbms commented 4 years ago

I guess you want to use the beancount-import UI with the account prediction provided by smart_importer and ingesters defined using the benacount.ingest framework?

I don't have plans to try to make that work myself, but if you're interested in making that happen, it does sound like it could be very useful, and I'm happy to offer advice and would be interested in merging it into beancount-import, for example.

The datasource API provided by beancount-import is already sufficient, I think, for integrating with beancount.ingest. Currently there is no API for plugging in a different account prediction mechanism, but that could be added.

Or maybe it would make sense for beancount-import to merge into smart_importer entirely, but not sure if the smart_importer authors are interested.

Zburatorul commented 4 years ago

Correct me if I'm wrong, but it sounds like the stumbling block of this program (ingester->smart_importer->beancount-import UI) is that the prediction/ML part needs to be factored out in beancount-import? Is there any other issue related to matching existing and currently imported transactions? It looks to me like smart_importer is capable of detecting duplication.

Is it possible then to get the smart_importer to do all the deduplication and account prediction, while simply turning off this functionality in beancount-import?

Your posts from two years ago seem relevant: https://www.mail-archive.com/beancount@googlegroups.com/msg01628.html https://www.mail-archive.com/beancount@googlegroups.com/msg01630.html

jbms commented 4 years ago

Currently beancount-import relies on an account name starting with Expenses:FIXME to indicate an "unknown" account, which has 3 effects:

  1. when matching to existing transactions in the journal, such an account can match any account name
  2. automatic prediction may be done
  3. the user may manually enter the correct account name in the UI

Individual beancount-import datasources are already responsible for doing deduplication, so it is no problem for smart_importer to do that. To let smart_importer perform the automatic prediction, while still preserving effects (1) and (3), beancount-import could be modified to support an extra metadata key on postings that would indicate the predicted account while still allowing the account to be considered "unknown".

This would be a pretty simple change to make, e.g. in _get_unknown_account_predictions in reconcile.py.

Zburatorul commented 4 years ago

@dumbPy, are you interested in trying to get your smart_importer to work with beancount-import along the lines of #60 ? You'd have to make it add two meta field to each predicted posting.

addisonklinke commented 1 year ago

@jbms Individual beancount-import datasources are already responsible for doing deduplication

Is this still accurate?

If I understand correctly, that means an ImporterProtocol subclass needs to confirm each transaction it generates isn't a duplicate of an existing entry in the transactions.beancount file?

jbms commented 1 year ago

The generic importer handles deduplication based on (account, date, units, narration), so individual ImporterProtocol subclasses should not need to do that.

https://github.com/jbms/beancount-import/blob/b8a78069f515519e27b42450cb525278a50b12bb/beancount_import/source/generic_importer_source.py#L122

addisonklinke commented 1 year ago

Hmm is there an option (aside from subclassing the source) to avoid requiring the narration match? Usually I tidy these up in the import UI, so what's in the transactions file is more concise and/or better formatted than the raw importer output

jbms commented 1 year ago

There isn't currently any such option, but I agree that matching on narration is problematic. Potentially you could match on just date, account, and amount alone. But it would be better to store the original narration as a metadata field, so that it can be used for training and for deduplication

addisonklinke commented 1 year ago

Yes currently I'm storing the original in the source_desc metadata field. I think it'd make more sense to use that as part of the deduplication key than the final narration