Closed darrell closed 9 years ago
I'm not sure we can make thesauruses do what we want... http://www.postgresql.org/docs/9.4/static/textsearch-dictionaries.html they seem in the current implementation to only be glorified synonym dictionaries than can handle phrases.
Oh, WRT "Connecticut", no, we don't want to go there. Once the scope of the matching problem expands far enough this full-text-search technique starts to fall apart, IMO, as there's just too much jurisdictional ambiguity for a simple token-based approach to succeed.
Hmm... I think it does do what we want.
Play around a bit.
create addressing_en.ths
as
ct: court courts ct
then
CREATE TEXT SEARCH DICTIONARY public.addresses_ths_en (
TEMPLATE = pg_catalog.thesaurus,
DictFile = addressing_en,
Dictionary = simple
);
CREATE TEXT SEARCH CONFIGURATION addressing_en (
COPY = simple
);
ALTER TEXT SEARCH CONFIGURATION addressing_en
ALTER MAPPING FOR asciiword, word
WITH addresses_ths_en, addressing_syn_en, addressing_stop_en;
returns:
> select to_tsvector('addressing_en','123 maple ct');
to_tsvector
-----------------------------------------------
'123':1 'court':3 'courts':4 'ct':5 'maple':2
(1 row)
Time: 5.209 ms
Yes, you’re right.
I'll work on the pull request. Probably not until tomorrow, though.
I’d note that we probably don’t want to willy nilly preserve all copies of things, it’s better to take all copies of things down to one canonical form. The place I see this being helpful is introducing ambiguity in places where we have terminological overlap.
So we shouldn’t go
ave: ave avenue
But we should go (bad example, because I don’t actually want to deal w/ states) ct: court connecticut
Yah, as I'm thinking harder about this. I suppose it really on makes sense on the single-letter directional prefixes where there are profoundly ambiguous decodings of the term.
w: w west
e: e east
n: n north
s: s south
I think my brain was not fully engaged on other suggestions.
resolved in pull #3
Instead of using synonyms, it might make sense to use thesauruses. That way you can keep the abbreviated form in the resulting vector as well as the expanded form.
So ct returns court, courts, and ct (And potentially Connecticut, but if we want to go down that road, it opens a bunch of other potential issues).
This is especially true for things like "E" which can be both a name of a street and "East", such as in Washington, D.C..
I'm happy to do the work, but I thought I should test the waters before implementing it.