Closed NickCEChuah closed 5 years ago
This one fails mostly because it's not actually a street address, and the training data was exclusively composed of street addresses. Specifically, it assigns a probability to each letter of the address for each component of a street address (there are 22 possible categories that a letter could be assigned to, which are described in the README file). None of these are presently suitable for a GPO Box or any other non-street address, unfortunately. So to directly answer your question, it would require substantial modifications and additional data. Probably the data would have to be synthetic for G/PO Boxes as I'm not aware of any freely available datasets for this.
The reason you got something strange for the street type ("BROW") is that of the collection of characters that the model assigned a class of "street type", the closet match to a known street type was BROW (based on the Jaro-Winkler string similarity measure). I'm going to take a stab and say that "Box" was though to to be the street type, which was then matched to BROW. This is the first time I've heard of this street type myself, but apparently it's a thing.
I'm going to close this issue as I consider it out of scope, but don't let this discourage you from working on any solutions if you feel so inclined. The reason I consider it out of scope for this project is that the overall objective was to leverage some freely available address data to train a classification model that's easier to handle than some unwieldy regex in view of solving problems like geolocation.
One suggestion for anyone (yourself, myself, or others who read this) going forward might be to introduce a new class of "other" so that you can at least confidently extract known parts like postcode, suburb, state, without falsely extracting street names, etc., when they aren't present. This might produce an output that looks something like the following for your example:
{
"other": "GPO Box 500606",
"locality_name": "Canberra",
"state": "ACT",
"postcode": "2004"
}
You could then use some downstream processing (perhaps even a regex) to test for things like G/PO Boxes, etc.
Thanks. I'm not expecting you to change your program either just trying to highlight potential gotchas. The other type of non Street address is the RMB address (road side mail box for rural properties) that will break your code too.
https://en.m.wikipedia.org/wiki/Roadside_Mail_Box
Cheers Nick
On Wed, 18 Sep 2019 at 4:53 pm, Jason Rigby notifications@github.com wrote:
This one fails mostly because it's not actually a street address, and the training data was exclusively composed of street addresses. Specifically, it assigns a probability to each letter of the address for each component of a street address (there are 22 possible categories that a letter could be assigned to, which are described in the README file). None of these are presently suitable for a GPO Box or any other non-street address, unfortunately. So to directly answer your question, it would require substantial modifications and additional data. Probably the data would have to be synthetic for G/PO Boxes as I'm not aware of any freely available datasets for this.
The reason you got something strange for the street type ("BROW") is that of the collection of characters that the model assigned a class of "street type", the closet match to a known street type was BROW (based on the Jaro-Winkler string similarity measure https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance). I'm going to take a stab and say that "Box" was though to to be the street type, which was then matched to BROW. This is the first time I've heard of this street type myself, but apparently it's a thing https://meteor.aihw.gov.au/content/index.phtml/itemId/429840.
I'm going to close this issue as I consider it out of scope, but don't let this discourage you from working on any solutions if you feel so inclined. The reason I consider it out of scope for this project is that the overall objective was to leverage some freely available address data to train a classification model that's easier to handle than some unwieldy regex in view of solving problems like geolocation.
One suggestion for anyone (yourself, myself, or others who read this) going forward might be to introduce a new class of "other" so that you can at least confidently extract known parts like postcode, suburb, state, without falsely extracting street names, etc., when they aren't present. This might produce an output that looks something like the following for your example:
{ "other": "GPO Box 500606", "locality_name": "Canberra", "state": "ACT", "postcode": "2004" }
You could then use some downstream processing (perhaps even a regex) to test for things like G/PO Boxes, etc.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasonrig/address-net/issues/8?email_source=notifications&email_token=ANHHLKC5Y5KOIPYVWNFX6PTQKHF7ZA5CNFSM4IXYTI52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67ALHA#issuecomment-532546972, or mute the thread https://github.com/notifications/unsubscribe-auth/ANHHLKF73FXFDSUDF6ZJNILQKHF7ZANCNFSM4IXYTI5Q .
Hi Jason, Great stuff and thanks for posting this up.
In real world, a lot of people use Postal address, like this: GPO Box 500606 Canberra Act 2004 When it goes through your parser, it returns :
{'street_name': 'GPO', 'street_type': 'BROW', 'postcode': '5006062004', 'locality_name': 'CANBERRA', 'state': 'AUSTRALIAN CAPITAL TERRITORY'}
Not sure why it generated a street type BROW. Is there training data that could address this? Or would that require some re-coding?
Cheers, Nick