Open smartini87 opened 2 years ago
Hi @smartini87 ,
As I checked with the libpostal library .exe
itself, it is not taking inputs with german umlauts or words with accents that's why pypostalwin
was not able to prase it. Added a screenshot below from address_parser.exe
This is the issue which was raised in lipostal
and says libpostal needs a 'UTF-8' encoded string.
This is why I have added a layer in the pypostalwin
to remove the special characters which are non ASCII values
Also, you have to normalize the non-English/special address before passing it to the parser using expandAddress
. it is mentioned in libpostal's readme
I think you can use the below function to remove the accents before passing it to it.
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
Let me know if that works or we need to wait for the libpostal
newer versions to allow the different character encodings.
thanks for the try-outs. The function you proposed works, but the output is not what I would like it to be, because it's technically false written (city): My idea would be to check whether a replacements of those characters is possible, if yes I will keep that information saved and when parsing is done I would change that letter back to the original one. I cannot just convert every string back to the umlauts, because there is the chance the city or street is originally written the same as being converted upfront.
Also, while facing the issue I was stuck in an infinite loop, without an error being shown to me, so I was not able to bypass this certain datarecord with an exception handling. Is there any solution to receive at least an error, when the parser could not retrieve a collection?
import pypostalwin
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
parser = pypostalwin.AddressParser()
try:
parsedAddress = parser.runParser(remove_accents("Weissgerber Str. 10, 84453 Mühldorf am Inn"))
print(parsedAddress)
except:
print('Error')
Thanks for the reply @smartini87 , I just tried the same code on my python env and pasted below,
It is not giving me an infinite loop. and please make sure you use the latest version,
pip install pypostalwin==0.0.3
and pypostalwin may be stuck if the character is non-ASCII values like ®,±, Æ, but I have added many exceptions possible that will remove these characters basically, the best practice is, to convert your address into the UTF8 encoded string and send them into the parser.
Script aborts when using letters with german umlauts (ä, ö, ü) letters with sharp s (ß).