allow for bare #XXX unit patterns

blackmad commented 4 years ago

We saw a bunch of queries like this in our incoming data and I wanted to fix them in parser

blackmad commented 4 years ago

Hey @Joxit - any thoughts on this change?

blackmad commented 4 years ago

Good point. Let me try again with regexing in the parser on all the existing unit words and breaking them out correctly.

If I took out the PhX pattern, would you take this change for the #UU unit patterns?

related, any good idea about how to deal with #A-2 which is currently split into the parser into two different tokens?

On Sun, Jun 21, 2020 at 7:15 PM Jones Magloire notifications@github.com wrote:

@Joxit requested changes on this pull request.

Penthouse alone is classified as a unit_type https://github.com/pelias/parser/blob/4667240354a55e9b8242bfaec2f26be7b0075447/resources/libpostal/dictionaries/en/unit_types_numbered.txt#L36

IMO, this change is not a solution but only a workaround for a use case... This is a way to much specific for penthouse, the right classifier should match every unit_type and not only for ph.

Plus this is wrong, ph1 should not be a unit, but a unit_type + unit

IDK how we can achieve that, if we should use Regex or update tokenizer.... 🤷‍♂️

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/parser/pull/104#pullrequestreview-434540661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMAE5OIC2AWDDIOJKHDRX2IBDANCNFSM4N2PL35A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

blackmad commented 4 years ago

I removed the PhX part of this change, so now it's just #UUUU, let me know what you think of this

Joxit commented 4 years ago

IDK if we can achieve the Penthouse issue with regex and in this classifier... IMO this needs a tokenizer update, but this may be more complex...

#A-2 will produce 3 tokens:

#A-2                            ➜   alphanumeric  1.00  
#A                              ➜   alpha  1.00   unit  1.00  
2                               ➜   numeric  1.00   housenumber  1.00

That means #A-2 is still present and can be classified

blackmad commented 4 years ago

thanks, fixed

On Mon, Jun 22, 2020 at 12:24 PM Jones Magloire notifications@github.com wrote:

@Joxit approved this pull request.

Much better 👍

In classifier/UnitClassifier.js https://github.com/pelias/parser/pull/104#discussion_r443680051:
 // If the previous token in a unit word, like apt or suite

 // and this token is something like A2, 3b, 120, A, label it as a unit (number)
if (hasPrevUnitToken && combinedUnitRegexp.test(span.body)) {

if (hasPrevUnitTypeToken && maybeUnitToken) {

span.classify(new UnitClassification(1))

}

// A unit token that starts with '#' is always a unit if it's not the first token

A token that starts with '#"*... ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/parser/pull/104#pullrequestreview-435084817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMBURKMR6P35EW3HVXDRX6AUBANCNFSM4N2PL35A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

Joxit commented 4 years ago

LGTM

pelias / parser

allow for bare #XXX unit patterns #104

Much better 👍