Open Jeeva-Ganesan opened 7 years ago
Hey @Jeeva-Ganesan, I usually prefer to just make the global model better for everyone rather than have people train custom models. Usually if there are parsing problems at this point, it's either a "bug" in our training data which can be fixed or a pattern found in real-world data that's not in the training data, which can often be generated or otherwise remedied. Just tried a handful of NZ addresses and they seemed fine to me. Can you be more specific as to what didn't work well?
AFAICT that data set is part of OpenAddresses and so we train on the entire country of New Zealand, although the data set used by OpenAddresses appears to be a now-deprecated version, so needs to be updated. Also, we use OSM boundaries/cities/suburbs instead of those provided in the original data set because the "locality_suburb" field contains a mix of cities and suburbs, which are separate fields in libpostal. As such, it's more convenient to use OSM's classifications throughout so the parser doesn't get confused by different labels for the same object when mixing datasets. So possible that it might get a few city names wrong if they're not in OSM. Since New Zealand is made up of islands, it might also be a place that suffers from the "orphan coastal cities" issue (#203 and https://github.com/openvenues/libpostal/issues/27#issuecomment-292740044), which is being fixed for v1.1.
Thanks for your reply. Here
>>> from postal.expand import expand_address
>>> expand_address('2A Carlton Street Auckland')
[u'2a carlton street auckland']
>>> expand_address('53 cook street Auckland')
[u'53 cook street auckland']
>>> parse_address('53 cook street, New Zeland')
[(u'53', u'house_number'), (u'cook street new zeland', u'road')]
>>> parse_address('1104,53 cook street, New Zealand')
[(u'1104,53', u'house_number'), (u'cook street', u'road'), (u'new zealand', u'country')]
>>> expand_address('Grd floor 53 cook street auckland')
[u'garden floor 53 cook street auckland']
. In the first example, both addresses were not expanded at all. Its returning exactly what I typed. I was expecting at least the country name to be added with it. . In the parse_address, if the country name is misspelt, its not parsing it as a country. . In the third one, flat number and apartment number is returned as house number. . In the fourth example, ground floor is returned as Garden floor
Adding the new street addresses layer in OpenAddresses: https://github.com/openaddresses/openaddresses/pull/3010, so that should make it in to 1.1. The suburb/city delineations are a little cleaner, which makes the source names usable with some simple logic. That would add more city/suburb names to supplement what's in OSM. Usually helps overall to get a more exhaustive list from a government source.
Hi, First of all, thanks for such a great contribution. This is an amazing project. We tried to use this for some address in New Zealand, unfortunately it didn't perform as we expected. There is this dataset - https://data.linz.govt.nz/layer/3353-nz-street-address/ which provides all the NZ addresses. We are thinking of training on this dataset. You also mentioned in some of the previous issues that you are working on making the pre processing and training pipeline automated. Is that available now? Or can you train on this dataset ?