Open TobiTobiM opened 5 years ago
Thanks for raising this. And yes, that is confusing... A hash indicates that it is an unique identifier and can be reliably used to identify a transaction and that no different transactions have the same hash. And I have also been using it similarly.
But this is more of an indication and can not reliably be used for that. - it is more a line identifier I guess..
I am not sure what's the best option is. Your first case could be solved by not only creating the sha of the actual line content but also the details
https://github.com/railslove/cmxl/blob/master/lib/cmxl/fields/transaction.rb#L10 https://github.com/railslove/cmxl/blob/master/lib/cmxl/fields/transaction.rb#L14
what are you thoughts?
On the your second case: this is probably a harder issue as we do not have any state to have a counter/nonce or something there. Sadly the MT format does not have reliable unique identifiers. hmm?
I think your solution to fix first case is fine. This is more or less what I did in my application. The fix should be easy because like you mentioned transaction
already has details
included.
If you want i can do a PR for that issue.
Indeed second case is hard to solve in cmxl. Maybe it is ok adding a warning to readme that this case has to be handled by the application which uses cmxl. If the application is reading through transactions
it's easy to add a counter there. But the developer has to know the case.
what are you thoughts?
Even though I wasn't asked, I would like to contribute some input on this topic, as it's something I've been spending quite some time on in the last couple of days.
First, as @TobiTobiM (and myself, and I'm sure others...) has learned the hard way, the only unique ID for a transaction is it's position within a statement. Everything else can (and therefore eventually will) have a duplicate.
IMO, this library should:
Statement#parse!
and either pass it to Field.parse
or assign it to an attribute of the parsed field (e.g. field.cmxl_id = "#{sha}_#{line_idx}"
).Transaction#sha
method, as it misleads users into thinking it can be used as a unique identifier.My impression is that Cmxl has tried to avoid having Cmxl::Field
instances depend on the Cmxl::Statement
that contains them. This is good software design: it allows fields to be constructed from an isolated line, which helps testing and maybe other use cases I don't know about.
But for the primary use-case of the library (parsing statements) it leads to everybody rolling their own transaction identifiers and eventually running into the problems encountered above.
Your first case could be solved by not only creating the sha of the actual line content but also the details
I just want to mention that this is what we were doing, and upgrading cmxl
(admittedly from a very old version) changed the parsing of 'details' slightly, so we ended up with a load of duplicated transactions. Hence my recent interest in the topic :wink:
thanks @grncdr for your input. very valuable.
So I think we for sure should add the details to generate the hash. this should be a simple change to the 61 field parser
Then we should add the line index to the Field and use it as part of the sha.
And the third step is to deprecate the wording sha
and introduce some method name that does not indicate global uniqueness but is a statement uniqueness indicator.
I also would not use cmxl_id
as _id
also suggests uniqueness. do you have any suggestions?
Any help implementing this would be helpful as I might be slow currently due to limited time.
would using that sha of the whole statement + field sha + line index help?
or as we try to somehow generate a unique identifier for the field we could allow passing in some identifier value to CMXL.parse
which could be a filename, db id, or some other global counter.
I am super sorry that that method and confusion caused you problems and wasted your time.
Wow, this is super interesting to hear since I've been tinkering with a similar issue.
There is another thing to consider which makes the issue a lot more difficult.
Since we added MT942 (Vormerkposten aka VMK) to the library, I would expect that a statement passed via MT942 would have the same SHA as the matching statement passed via MT940 so you can match those together.
Since the order of statements would not necessarily be the same in both format this would rule out the line index influence. I have not yet figured out how to resolve that issue since it seems there is no real transaction ID passed with MT94X-format unlike CAMT which has an transaction ID matching between VMK and regular statements.
@bumi maybe I've overlooked some real transaction ID within the documentation, mind taking a look for yourself, please?
no there is no real transaction ID in MT9XX. Thus anything we try to do on our side will always be some kind of a hack. - for that reason cmxl also does not provide such an ID though the method sha
indicated something wrong.
we could make an id generator configurable, that gets the the statement field object and line index. So everybody can configure it for custom needs with global input from outside.
something like:
-> (statement, field, index) { "#{Time.now.strftime('%Y-%m-%d')}-#{stement.sha}-#{field.sha}-#{index}" }
would using that sha of the whole statement + field sha + line index help?
Yes, I meant for sha
in my suggestion to refer to the SHA of the whole statement.
In that case you don't actually need the field SHA, since only one thing can be at a given line index. The field SHA might still be useful for reconciling VMK/STA data though. See the last section for that.
I am super sorry that that method and confusion caused you problems and wasted your time.
Not even close to the amount of time the library has saved us! So please don't read me wrong: we :heart: this library! :smile:
And the third step is to deprecate the wording sha and introduce some method name that does not indicate global uniqueness but is a statement uniqueness indicator. I also would not use cmxl_id as _id also suggests uniqueness. do you have any suggestions?
I think the combination of statement SHA + line index should be globally unique though! Hence my suggestion to call it cmxl_id
. But, see below for the caveats.
Since we added MT942 (Vormerkposten aka VMK) to the library, I would expect that a statement passed via MT942 would have the same SHA as the matching statement passed via MT940 so you can match those together.
Since the order of statements would not necessarily be the same in both format this would rule out the line index influence.
I can guarantee they're not the same. In fact, transactions aren't even grouped together in the same statements in each format. Unfortunately, I also can't leave out the line index entirely (it's needed to handle the pathological duplicate transaction case) which leads me to...
If you want to build a system that reliably stores and deduplicates transaction data from both MT942 and MT940 (referred to as VMK and STA below). A transaction needs the following IDs:
vmk_sha
- hash of full MT942 statementvmk_index
- position of transaction in MT942 statementsta_sha
- hash of full MT940 statementsta_index
- position of transaction in MT940 statementown_sha
- hash of transaction (and it's details/:86
lines) not uniqueThe transaction would then have 2 unique composite ID's: (vmk_sha, vmk_index)
and (sta_sha, sta_index)
, while the own_sha
would only be used to speed up reconciliation of VMK and STA.
... banks 😅
A further note about own_hash
above: it would be a really good idea if the various Field
types worked off of frozen strings representing unaltered field lines. Then they could produce a consistent sha
value no matter what changes are made to the parsers.
I used the value given by the sha method in statement and transaction to find them in a database. This works fine for me but some weeks before i thought i lost some transactions in my database. In fact they were all there but the sha hashes were the same. 2 cases
First case: Debit transfer with same day, same amount, same receiver account only transaction information differs (invoice number) Sha hash is the same because all fields in :61 are identical. The difference is in :86 My quick fix is i build my own sha from :61 and information from :86.
Second case: Credit transfer all values identical. Sender made accidently same transaction twice the same day. This case is really rare but happend in real world. My fix for this i built also my own hash and add a increment to raw transaction data (source).
So my questions to discuss:
Are the hashes meant to be used to identify transactions?
If yes should cmxl handle these rare cases or should the piece of software which uses cmxl handle this?