Doesn't detect OTP in a french bank message

jd1378 / otphelper

open-source application that can copy OTP and codes from notifications automatically for you

GNU Affero General Public License v3.0

449 stars 33 forks source link

Doesn't detect OTP in a french bank message #42

Closed Cwpute closed 1 year ago

Cwpute commented 1 year ago

The message goes like this:

CMB Le code pour la confirmation du virement SEPA du 31/12 à 12:12 est le 123456. Il est valide 5 min. NE DONNEZ JAMAIS CE CODE PAR TELEPHONE.

The code here is 123456. The word that translates to OTP in french here is code de confirmation or simply code.

jd1378 commented 1 year ago

well, sadly the first number encounter after "code" keyword is not the code, and it got special character ":" and "/" before the actual code as well I don't think I can update the pattern to capture this code

recently I got the idea of replacing regex pattern with a small ai model that can detect the code, but making one requires many examples but that probably can capture codes like your example better than a regex pattern

for now I can't do anything about this one, sorry

Cwpute commented 1 year ago

Well, regarding the special characters i can't tell anything as i don't know much about regex. Does that mean any message that contains either : or / won't ever be supported ? That seems… strange.

But regarding the actual OTP code: the other numeral successions in this message don't (and won't ever) exceed 3 digits, because 12:12 is the time and 31/12 is the date. In fact, i think all of the OTP codes i've encountered have always been at least 4 digits long. So if it's not already the case, i think you can try to detect OTPs as being at least 3 digits long,

Oh but maybe you have to take into account OTPs that also have letters in them… hmm.

Cwpute commented 1 year ago

When OTPs have letters, do they also always have digits ? i think they do. If so, you may try to detect OTPs as a succession of at least 4 characters (without space between them) containing at least one digit…

Cwpute commented 1 year ago

Well, browsing other closed issue it seems this is already what the code does… i can't help much more than this i guess…

well, sadly the first number encounter after "code" keyword is not the code

yeah but if it doesn't match the 4 character rule doesn't it still search further in the message ? does it stoé right there ? If so, could it be made so it tries to find a matching pattern in the rest of the message ?

jd1378 commented 1 year ago

When OTPs have letters, do they also always have digits ? i think they do.

no, I've seen full letter no digits OTP (from yahoo)

If so, you may try to detect OTPs as a succession of at least 4 characters (without space between them) containing at least one digit

and there's OTPs that are a mix of letter and numbers with space between them lol

yeah but if it doesn't match the 4 character rule doesn't it still search further in the message ? does it stoé right there ? If so, could it be made so it tries to find a matching pattern in the rest of the message ?

yes it stops there I have made a link to visualize what's happening: https://regex101.com/r/67OToo/1

I'm not sure if I can improve it further, I'll give it a try

jd1378 commented 1 year ago

nope, saw a little hope but even that would not get past ":" problem

Cwpute commented 1 year ago

What could be made is having regex search for the word code first (as it does), then if it finds it, run the usual. But if it doesn't find anything that matches an OTP right after that word (as it does right now), search for any instance of est or est le and try to find an OTP right after that. As i understand it, : ans / are searched for after a word trigger like code to detect a possible OTP, right ? what i suggest essentially is having a second, more focused OTP search after these characters are detected and do not match. I can't try to reprod it in the regex editor, it's the first time i use it and i'm a bit lost haha.

This would solve my issue. I guess this search pattern could be extended to other languages, for example in english it could similarly search for is or is the...

What do you think ?

jd1378 commented 1 year ago

It already has a fallback, and that fallback behavior is to run another regex that is specialized for codes that appear at the beginning of the message (like instagram, google, etc.) beside that, this would be too specialized, and I don't want to specialize only to cover one use case

and we can't use is is the because this is a multi language app and in many cases the message does not include is or something similar to point to the code after it

the special : rule is there to support alpha numerical codes (a-z0-9), which without it, how would you distinguish a 4 letter word from an OTP ?

I'm sorry but I don't think I'm gonna put effort into handling this one single case overall

maybe if I get more issues that are similar, I'll try to go for the AI approach, but for now, not going to handle these cases where message format is very unexpected