jasonrig / address-net

A package to structure Australian addresses
MIT License
194 stars 86 forks source link

Another common abbreviation for Level #6

Open poorlymac opened 5 years ago

poorlymac commented 5 years ago

Hi,

A (unfortunately) common abbreviation for level I have come across is a simple L. For example : UNIT 900, L 9, 50 THINGO ST, HOOHAAVILLE, VIC 3000. I even tried adding L to lookups.py and deleting the cache but to no avail. The kind of result I get is : "flat_number": "9009", "flat_number_prefix": "L", "flat_type": "UNIT", "locality_name": "HOOHAAVILLE", "number_first": "50", "original": "UNIT 900, L 9, 50 THINGO ST, HOOHAAVILLE, VIC 3000", "postcode": "3000", "state": "VICTORIA", "street_name": "THINGO", "street_type": "STREET"

or L9 with no space drags the 9 into the 50.

Is there a way to get L in and recognised?

jasonrig commented 5 years ago

As with issue #5 I think this will come down to tuning the address generation code used during training. The reason your approach of adding L to lookups.py didn't work is that the model is assigning a high probability of L being a unit number prefix, and a high probability of 900 and 9 both being the unit number itself (so the code concatenates them since it groups letters of the same class).

Perhaps issue #5 and #6 can be worked on together since they both would involve tuning the address synthesis code.

If you wanted to have a play around and see if anything works for you, I would try to work through the synthesise_address function. It's a little ugly, but you can see that for a given clean record from GNAF, it mutates the string in various ways while keeping track of what each character's class (unit type, street name, etc.) is. The goal would be to include more examples of the cases you're seeing that fail when performing these random mutations.