moov-io / watchman

AML/CTF/KYC/OFAC Search of global watchlist and sanctions
https://moov-io.github.io/watchman/
Apache License 2.0
330 stars 87 forks source link

feature: handle multiple alphabets #543

Open adamdecaf opened 5 months ago

adamdecaf commented 5 months ago

Slack: https://moov-io.slack.com/archives/CFUCEBGH2/p1710500854485369

I have some results with curl 'http://localhost:8084/search?q=wiam+wahhab' and with curl 'http://localhost:8084/search?q=الخليلي+سيف. It's the same person and even if results aren't the same, it means that you manage another alphabets.

The first link is a study about the phonetisation logic of the Arab language and the second is just a table of the different writing of the english phonetisation. https://ccc.inaoep.mx/~villasen/bib/reglas%20de%20fonetizacion%20Arabe.pdf http://www.aurint.de/phonetic_transcription.htm The goal is not to have a 100% trusted translation, it's impossible with phonetisation transcription. But lucky we are, there is a Jaro Winkler passing. The majority of the lists datas are in latin. So it would be too big I suppose to transcribe persons BUT if we do only once a big transcription all over the lists datas to have different alphabets phonetisation transcription for all it wouldn't be to big. The execution way would be : get the lists datas transcribe to different alphabets STORE the transcriptions into the database as table "arabic", "latin", "mandarin" etc and mark if it's the originals datas or a transcription get the person to check get the alphabet/language of the person datas (you already do that with the package "stopwords") research only in tables of the same alphabet AND get down the score minimum if the table alphabet isn't the original one from the list Of course it will be a lot of work to transcribe into all the alphabets AND all alphabets can have different phonetisations (like english vs french). But after a lot of thinking and research it came to me that it's the best solution without being too big or with less trust.

Projects:

Arabic Phonetic Mapping Algorithm.pdf Arabic Phonetization .pdf

Related: https://github.com/moov-io/watchman/issues/150