jasonrig / address-net

A package to structure Australian addresses
MIT License
194 stars 86 forks source link

Lot Number over 3 characters issue #5

Open poorlymac opened 5 years ago

poorlymac commented 5 years ago

Hi,

With the pretrained model if I have an address like this: Lot 442, 123 AAA RD, BBB, WA 6000 it will get parsed nicely like this: "flat_number": "442", "flat_type": "LOT", "locality_name": "BBB", "number_first": "123", "postcode": "6000", "state": "WESTERN AUSTRALIA", "street_name": "AAA", "street_type": "ROAD" Nice !

However if the Lot number increases to 4 characters like this: Lot 4424, 123 AAA RD, BBB, WA 6000 then I get odd results like this: "building_name": "O", "flat_number": "4424", "flat_type": "LOT", "locality_name": "BBB", "number_first": "123", "postcode": "6000", "state": "WESTERN AUSTRALIA", "street_name": "AAA", "street_type": "ROAD"

Is there a way to fix this ?

P.S. Really great program by the way !

jasonrig commented 5 years ago

Hi @poorlymac Thanks for the kind words and for reporting the issue.

Given that this is a probabilistic model, the answer unfortunately would come down to creating a more realistic training set (or rather, including more examples of rare addresses). I'm guessing that the origin of this error is just that there aren't so many "Lot" property types with four-digit numbers.

If I'm right about the cause, the so the solution is simple insofar as it would require retraining with a better dataset. This shouldn't be so difficult since the pretrained model would be a reasonable set of starting parameters.

The approach I would take is to synthesise more addresses with larger street/lot/unit numbers during training. That said, the shortcoming of this code is that there is no real way to measure real-world performance since nobody, to my knowledge, has an annotated set of human-entered addresses. This means it would be difficult to assess whether the performance overall is increasing after such a change, or whether it improves these uncommon cases at the expense of performance on the more common address number ranges.

I'm happy to leave this issue open, but I'm unsure when/whether I'll get around to this. If you do make any useful changes to the code, you're more than welcome to send through a PR!

poorlymac commented 5 years ago

Hi @jasonrig thanks for your feedback. I am willing to have a go at doing some retraining with a lot of address data but I have no idea how to do it (I am technical, but have never done this sort of stuff). Can you point me at a dummies guide on how I would go about doing it ?