Closed boshkins closed 9 years ago
Hey Anatoly,
.address_components controls the types of dictionaries used in expand. In some use cases we may already know which component we're sending e.g. during indexing/ingestion if the data source's fields are already separated or if the original string has been run through the address parser prior to expansion.
Re: the address parser, it is not currently on Github (pieces of it are) and has a different API that does include parsed components. The parser's training now and may need some tuning to work well, will update the Makefile when it's ready/uploaded to S3.
As far as output not changing, most of the very common abbreviations affect the ADDRESS_STREET component, so might have to dig a little to get different output. For instance, if you set .address_options=ADDRESS_HOUSE_NUMBER | ADDRESS_UNIT the main program will produce different output:
# Expected: apt is expanded to apartment but st is not expanded to street/saint
./libpostal "123 main st apt 2" en
123 main st apartment 2
There's a file called gazetteer_data.c which contains the mapping of dictionaries to address components.
Note that .address_components only controls expansion from the repo-resident dictionaries. Expansions of place names from GeoNames are coming soon but still need to be compressed to a reasonable size since they will need to be downloaded from S3 with the other data files periodically.
./al
I know this has been closed for a while, but is .address_options
still the primary way to change the address parser? and is it possible to do this from the ./address_parser
cli as a test? Attempting to change like it like .language and .country was not an option.
I'm trying to return address line 1, address line 2 (ex. suite, unit, po. box), city, state, country, postal code while the default is now set to .address_components = ADDRESS_NAME | ADDRESS_HOUSE_NUMBER | ADDRESS_STREET | ADDRESS_UNIT,
I want to configure it for
.address_components = ADDRESS_NAME | ADDRESS_HOUSE_NUMBER | ADDRESS_STREET | ADDRESS_UNIT | ADDRESS_LOCALITY | ADDRESS_COUNTRY | ADDRESS_POSTAL_CODE,
I attempted to change the defaults in libpostal.c, however that seemed to break the build. Thanks for the help, let me know if there's a better issue thread to ask this question.
The options described here are for expand_address (normalizing abbreviations, etc.)
The parser only has nominal options - language/country were originally going to be used in the machine learning models (e.g. for when the country is known a priori) but in fact the global model was more accurate without those options. They're essentially deprecated.
The standard parser should give you the results you're looking for, though it will be more granular than line1/line2. libpostal doesn't currently handle unit/suite numbers (very few examples of these sorts of addresses in OSM) so under the current models they're often labeled as "road" or "house_number". The new release I'm working on includes several new tags for unit, level, etc. (see #48).
Al,
In the normalize_options structure, I see a bit mask address_components, which is by default set to ADDRESS_HOUSE_NUMBER | ADDRESS_STREET | ADDRESS_UNIT. This looks like a way of controlling some aspects of address parsing and (hopefully) providing some feedback to the calling code as to which parts of the address have been identified. I tried populating the option with different values but saw no difference on the output. A clarification or maybe some examples would be very much appreciated.
Thank you, Anatoly