selva221724 / pypostalwin

libpostal wrapper python package for windows
MIT License
11 stars 5 forks source link

error when using german specific letters #5

Open smartini87 opened 2 years ago

smartini87 commented 2 years ago

Script aborts when using letters with german umlauts (ä, ö, ü) letters with sharp s (ß).

selva221724 commented 2 years ago

Hi @smartini87 ,

As I checked with the libpostal library .exe itself, it is not taking inputs with german umlauts or words with accents that's why pypostalwin was not able to prase it. Added a screenshot below from address_parser.exe

This is the issue which was raised in lipostal and says libpostal needs a 'UTF-8' encoded string.

image

This is why I have added a layer in the pypostalwin to remove the special characters which are non ASCII values

image

Also, you have to normalize the non-English/special address before passing it to the parser using expandAddress . it is mentioned in libpostal's readme

I think you can use the below function to remove the accents before passing it to it.

import unicodedata
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

Let me know if that works or we need to wait for the libpostal newer versions to allow the different character encodings.

smartini87 commented 2 years ago

thanks for the try-outs. The function you proposed works, but the output is not what I would like it to be, because it's technically false written (city): grafik My idea would be to check whether a replacements of those characters is possible, if yes I will keep that information saved and when parsing is done I would change that letter back to the original one. I cannot just convert every string back to the umlauts, because there is the chance the city or street is originally written the same as being converted upfront.

Also, while facing the issue I was stuck in an infinite loop, without an error being shown to me, so I was not able to bypass this certain datarecord with an exception handling. Is there any solution to receive at least an error, when the parser could not retrieve a collection?

import pypostalwin
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

parser = pypostalwin.AddressParser()
try:
    parsedAddress = parser.runParser(remove_accents("Weissgerber Str. 10, 84453 Mühldorf am Inn"))
    print(parsedAddress)
except:
    print('Error')
selva221724 commented 2 years ago

Thanks for the reply @smartini87 , I just tried the same code on my python env and pasted below,

image

It is not giving me an infinite loop. and please make sure you use the latest version,

pip install pypostalwin==0.0.3 

and pypostalwin may be stuck if the character is non-ASCII values like ®,±, Æ, but I have added many exceptions possible that will remove these characters basically, the best practice is, to convert your address into the UTF8 encoded string and send them into the parser.