jasonrig / address-net

A package to structure Australian addresses
MIT License
195 stars 86 forks source link

Retrain model #9

Open 1653100 opened 4 years ago

1653100 commented 4 years ago

Hello, I had used your package and it is very usefull. But the my data is formatted in UNICODE, which is Vietnamese, and it not working well. So can i use your code to retrain a new model for my own Vietnamese data? If yes, can you please help me? Thank you a lot. For UNICODE example, "Số nhà 25, ngõ 294 Kim Mã, Phường Kim Mã, Quận Ba Đình, Thành phố Hà Nội". "street" is now "ngõ", "state" is now "Quận", ... Sorry for my bad english, Looking forward to hearing from you soon.

jasonrig commented 4 years ago

Your English is completely fine, don't worry!

This model is trained only on Australian address data, so it will not work at all for Vietnamese addresses, and probably it will have a lot of problems with any other country.

The model itself is quite simple, so you can retrain it. You can see from my answer in issue #10 that the model produces one class per character. Since you are using unicode characters for the Vietnamese language, there are many more possible characters than the standard English alphabet (e.g. ă, â, đ, ê, ô, ơ). So, you have a choice:

  1. expand the number of possible characters ("vocabulary") to be bigger
  2. find a method to reduce the characters with accent marks back to their base character, e.g. ă, â -> a

Once you have decided how you will approach the problem, you need to find a structured database of addresses. You can use this to automatically generate labelled training data.

1653100 commented 4 years ago

Thank you so much. Your answer helped me a lot. I have rebuilt the model using keras, and it ran well. Even though it doesn't work as well as yours, the predict is still mislabeled by wrong letters. By the way, can i have your model outline, like the order of layers, the number of layers, .... Once again, thank you a lot. ^^ Have a nice day.