pramsey / pgsql-addressing-dictionary

TSearch dictionaries for addresses
MIT License
58 stars 12 forks source link

Preserve original abbreviation as well #2

Closed darrell closed 9 years ago

darrell commented 9 years ago

Instead of using synonyms, it might make sense to use thesauruses. That way you can keep the abbreviated form in the resulting vector as well as the expanded form.

So ct returns court, courts, and ct (And potentially Connecticut, but if we want to go down that road, it opens a bunch of other potential issues).

This is especially true for things like "E" which can be both a name of a street and "East", such as in Washington, D.C..

I'm happy to do the work, but I thought I should test the waters before implementing it.

pramsey commented 9 years ago

I'm not sure we can make thesauruses do what we want... http://www.postgresql.org/docs/9.4/static/textsearch-dictionaries.html they seem in the current implementation to only be glorified synonym dictionaries than can handle phrases.

pramsey commented 9 years ago

Oh, WRT "Connecticut", no, we don't want to go there. Once the scope of the matching problem expands far enough this full-text-search technique starts to fall apart, IMO, as there's just too much jurisdictional ambiguity for a simple token-based approach to succeed.

darrell commented 9 years ago

Hmm... I think it does do what we want.

Play around a bit.

create addressing_en.ths as

ct: court courts ct

then

CREATE TEXT SEARCH DICTIONARY public.addresses_ths_en (
    TEMPLATE = pg_catalog.thesaurus,
    DictFile = addressing_en,
    Dictionary = simple
);

CREATE TEXT SEARCH CONFIGURATION addressing_en (
        COPY = simple
);

ALTER TEXT SEARCH CONFIGURATION addressing_en
    ALTER MAPPING FOR asciiword, word
    WITH addresses_ths_en, addressing_syn_en, addressing_stop_en;

returns:

> select to_tsvector('addressing_en','123 maple ct');
                  to_tsvector                  
-----------------------------------------------
 '123':1 'court':3 'courts':4 'ct':5 'maple':2
(1 row)

Time: 5.209 ms
pramsey commented 9 years ago

Yes, you’re right.  

darrell commented 9 years ago

I'll work on the pull request. Probably not until tomorrow, though.

pramsey commented 9 years ago

I’d note that we probably don’t want to willy nilly preserve all copies of things, it’s better to take all copies of things down to one canonical form. The place I see this being helpful is introducing ambiguity in places where we have terminological overlap. 

So we shouldn’t go 

ave: ave avenue

But we should go (bad example, because I don’t actually want to deal w/ states)   ct: court connecticut

 

darrell commented 9 years ago

Yah, as I'm thinking harder about this. I suppose it really on makes sense on the single-letter directional prefixes where there are profoundly ambiguous decodings of the term.

w: w west
e: e east
n: n north
s: s south

I think my brain was not fully engaged on other suggestions.

darrell commented 9 years ago

resolved in pull #3