jasonrig / address-net

A package to structure Australian addresses
MIT License
194 stars 86 forks source link

Its messing up few of the characters and putting them in other cateogries. #17

Closed MohsinTariq10 closed 2 years ago

MohsinTariq10 commented 2 years ago

for addresses like Example1: 'rathbone mirrison bakery kenmore road' it gives this {"building_name": "RATHBONE MIRRISONB", "street_name": " AKERY KENMORE", "street_type": "ROAD"}

Example2: pontefract general infirmary southgate pontefract {"street_name": "PONTEFRAT", "building_name": "CGENERALINFIRMARYSOUTHATE", "locality_name": "GPONTEFR", "state": "AUSTRALIAN CAPITAL TERRITORY", "street_type": "COURT"}

Can you tell me how it can be fixed. seems like a problem in script rather than model.

Aretle commented 2 years ago

I've seen this before and just chalked it up to random prediction errors to do with the model rather than the script. The examples here don't seem too similar to the addresses used for training the model, which leads me to suspect that that is the cause of this issue. The model used won't be 100% accurate all the time.

You could retrain the model to better fit with the format of addresses that you will be predicting, that might help solve your problem. If you wanted to confirm whether the problem comes from the script rather than the model you would have to debug the addressnet code line by line from where the input address string is inferred by Tensorflow to where you get the formatted address output from addressnet.

Please correct me if I'm wrong on anything.

Good luck :)

jasonrig commented 2 years ago

@MohsinTariq10 I'm pretty sure what you're seeing is as @Aretle said, where the class assigned to the individual character is wrong, and it gets bundled up with the wrong part of the address.

FYI, the characters get grouped according to their class here: https://github.com/jasonrig/address-net/blob/master/addressnet/predict.py#L147

My guess is that this really comes down to having very incomplete addresses (and non-Australian addresses, it would seem), whereas the training data was always (or at least very close to) a complete address. Even ignoring that they appear to be UK addresses, those examples you gave are particularly difficult. Consider the second one:

pontefract general infirmary southgate pontefract

Without prior knowledge about location and hospital, it's pretty difficult to discern which is the suburb, street name, building name, etc. Personally I've never heard of Pontefract, so, for example, I wouldn't know if it was the street name and you just left off "Road", or whether the suburb is actually two words "Southgate Pontefract". My point is that even for a human, you have to spend a little bit of time working it out. 😉

The general approach to this model is transferrable to most/all addressing schemes (unless dealing with more complex writing systems like Japanese or Chinese, for example) so long as you have some source data, like a government database, and some knowledge of how it could possibly be structured when entered into your system, you should be able to use this model as inspiration. You will see that I made many assumptions in this regard. Together, these give the model some flexibility when it comes to spelling variation, as well as tolerance to some permutations/omissions, that regex matching can't handle. But, since it is probabilistic in nature, if the model makes a mistake, you're kind of stuck with it unless you improve your source data and retrain (or alternatively have some corrective step after the model produces the output).

MohsinTariq10 commented 2 years ago

Thank you so much for the explanation, what i wanted to know was that was it related to the model or the post processing step.

jasonrig commented 2 years ago

@MohsinTariq10 yeah I understand. I think the short answer is that it's the model, long answer is that they're both interdependent (post processing has some assumptions about the model, which in turn has assumptions about the data).