railslove / cmxl

your friendly MT940 SWIFT file parser for bank statements
http://railslove.com
MIT License
46 stars 25 forks source link

Question/Discussion: Use of sha value for transactions #37

Open TobiTobiM opened 5 years ago

TobiTobiM commented 5 years ago

I used the value given by the sha method in statement and transaction to find them in a database. This works fine for me but some weeks before i thought i lost some transactions in my database. In fact they were all there but the sha hashes were the same. 2 cases

First case: Debit transfer with same day, same amount, same receiver account only transaction information differs (invoice number) Sha hash is the same because all fields in :61 are identical. The difference is in :86 My quick fix is i build my own sha from :61 and information from :86.

Second case: Credit transfer all values identical. Sender made accidently same transaction twice the same day. This case is really rare but happend in real world. My fix for this i built also my own hash and add a increment to raw transaction data (source).

So my questions to discuss:

Are the hashes meant to be used to identify transactions?

If yes should cmxl handle these rare cases or should the piece of software which uses cmxl handle this?

bumi commented 5 years ago

Thanks for raising this. And yes, that is confusing... A hash indicates that it is an unique identifier and can be reliably used to identify a transaction and that no different transactions have the same hash. And I have also been using it similarly.

But this is more of an indication and can not reliably be used for that. - it is more a line identifier I guess..

I am not sure what's the best option is. Your first case could be solved by not only creating the sha of the actual line content but also the details

https://github.com/railslove/cmxl/blob/master/lib/cmxl/fields/transaction.rb#L10 https://github.com/railslove/cmxl/blob/master/lib/cmxl/fields/transaction.rb#L14

what are you thoughts?

On the your second case: this is probably a harder issue as we do not have any state to have a counter/nonce or something there. Sadly the MT format does not have reliable unique identifiers. hmm?

TobiTobiM commented 5 years ago

I think your solution to fix first case is fine. This is more or less what I did in my application. The fix should be easy because like you mentioned transaction already has details included.

If you want i can do a PR for that issue.

Indeed second case is hard to solve in cmxl. Maybe it is ok adding a warning to readme that this case has to be handled by the application which uses cmxl. If the application is reading through transactions it's easy to add a counter there. But the developer has to know the case.

grncdr commented 4 years ago

what are you thoughts?

Even though I wasn't asked, I would like to contribute some input on this topic, as it's something I've been spending quite some time on in the last couple of days.

First, as @TobiTobiM (and myself, and I'm sure others...) has learned the hard way, the only unique ID for a transaction is it's position within a statement. Everything else can (and therefore eventually will) have a duplicate.

IMO, this library should:

  1. Generate a reliable ID for each line in Statement#parse! and either pass it to Field.parse or assign it to an attribute of the parsed field (e.g. field.cmxl_id = "#{sha}_#{line_idx}").
  2. remove the Transaction#sha method, as it misleads users into thinking it can be used as a unique identifier.

the trade-off

My impression is that Cmxl has tried to avoid having Cmxl::Field instances depend on the Cmxl::Statement that contains them. This is good software design: it allows fields to be constructed from an isolated line, which helps testing and maybe other use cases I don't know about.

But for the primary use-case of the library (parsing statements) it leads to everybody rolling their own transaction identifiers and eventually running into the problems encountered above.

grncdr commented 4 years ago

Your first case could be solved by not only creating the sha of the actual line content but also the details

I just want to mention that this is what we were doing, and upgrading cmxl (admittedly from a very old version) changed the parsing of 'details' slightly, so we ended up with a load of duplicated transactions. Hence my recent interest in the topic :wink:

bumi commented 4 years ago

thanks @grncdr for your input. very valuable.

So I think we for sure should add the details to generate the hash. this should be a simple change to the 61 field parser

Then we should add the line index to the Field and use it as part of the sha.

And the third step is to deprecate the wording sha and introduce some method name that does not indicate global uniqueness but is a statement uniqueness indicator. I also would not use cmxl_id as _id also suggests uniqueness. do you have any suggestions?

Any help implementing this would be helpful as I might be slow currently due to limited time.

bumi commented 4 years ago

would using that sha of the whole statement + field sha + line index help?

or as we try to somehow generate a unique identifier for the field we could allow passing in some identifier value to CMXL.parse which could be a filename, db id, or some other global counter.

I am super sorry that that method and confusion caused you problems and wasted your time.

Uepsilon commented 4 years ago

Wow, this is super interesting to hear since I've been tinkering with a similar issue.

There is another thing to consider which makes the issue a lot more difficult.

Since we added MT942 (Vormerkposten aka VMK) to the library, I would expect that a statement passed via MT942 would have the same SHA as the matching statement passed via MT940 so you can match those together.

Since the order of statements would not necessarily be the same in both format this would rule out the line index influence. I have not yet figured out how to resolve that issue since it seems there is no real transaction ID passed with MT94X-format unlike CAMT which has an transaction ID matching between VMK and regular statements.

@bumi maybe I've overlooked some real transaction ID within the documentation, mind taking a look for yourself, please?

bumi commented 4 years ago

no there is no real transaction ID in MT9XX. Thus anything we try to do on our side will always be some kind of a hack. - for that reason cmxl also does not provide such an ID though the method sha indicated something wrong.

we could make an id generator configurable, that gets the the statement field object and line index. So everybody can configure it for custom needs with global input from outside.

something like: -> (statement, field, index) { "#{Time.now.strftime('%Y-%m-%d')}-#{stement.sha}-#{field.sha}-#{index}" }

grncdr commented 4 years ago

would using that sha of the whole statement + field sha + line index help?

Yes, I meant for sha in my suggestion to refer to the SHA of the whole statement.

In that case you don't actually need the field SHA, since only one thing can be at a given line index. The field SHA might still be useful for reconciling VMK/STA data though. See the last section for that.

I am super sorry that that method and confusion caused you problems and wasted your time.

Not even close to the amount of time the library has saved us! So please don't read me wrong: we :heart: this library! :smile:

And the third step is to deprecate the wording sha and introduce some method name that does not indicate global uniqueness but is a statement uniqueness indicator. I also would not use cmxl_id as _id also suggests uniqueness. do you have any suggestions?

I think the combination of statement SHA + line index should be globally unique though! Hence my suggestion to call it cmxl_id. But, see below for the caveats.

The problem with using the statement hash

Since we added MT942 (Vormerkposten aka VMK) to the library, I would expect that a statement passed via MT942 would have the same SHA as the matching statement passed via MT940 so you can match those together.

Since the order of statements would not necessarily be the same in both format this would rule out the line index influence.

I can guarantee they're not the same. In fact, transactions aren't even grouped together in the same statements in each format. Unfortunately, I also can't leave out the line index entirely (it's needed to handle the pathological duplicate transaction case) which leads me to...

if hashes aren't working, you aren't using enough of them :wink:

If you want to build a system that reliably stores and deduplicates transaction data from both MT942 and MT940 (referred to as VMK and STA below). A transaction needs the following IDs:

The transaction would then have 2 unique composite ID's: (vmk_sha, vmk_index)and (sta_sha, sta_index), while the own_sha would only be used to speed up reconciliation of VMK and STA.

... banks 😅

grncdr commented 4 years ago

A further note about own_hash above: it would be a really good idea if the various Field types worked off of frozen strings representing unaltered field lines. Then they could produce a consistent sha value no matter what changes are made to the parsers.