Closed blackmad closed 4 years ago
Hey @Joxit - any thoughts on this change?
Good point. Let me try again with regexing in the parser on all the existing unit words and breaking them out correctly.
If I took out the PhX pattern, would you take this change for the #UU unit patterns?
related, any good idea about how to deal with #A-2 which is currently split into the parser into two different tokens?
On Sun, Jun 21, 2020 at 7:15 PM Jones Magloire notifications@github.com wrote:
@Joxit requested changes on this pull request.
Penthouse alone is classified as a unit_type https://github.com/pelias/parser/blob/4667240354a55e9b8242bfaec2f26be7b0075447/resources/libpostal/dictionaries/en/unit_types_numbered.txt#L36
IMO, this change is not a solution but only a workaround for a use case... This is a way to much specific for penthouse, the right classifier should match every unit_type and not only for ph.
Plus this is wrong, ph1 should not be a unit, but a unit_type + unit
IDK how we can achieve that, if we should use Regex or update tokenizer.... 🤷♂️
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/parser/pull/104#pullrequestreview-434540661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMAE5OIC2AWDDIOJKHDRX2IBDANCNFSM4N2PL35A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
I removed the PhX part of this change, so now it's just #UUUU, let me know what you think of this
IDK if we can achieve the Penthouse issue with regex and in this classifier... IMO this needs a tokenizer update, but this may be more complex...
#A-2
will produce 3 tokens:
#A-2 ➜ alphanumeric 1.00
#A ➜ alpha 1.00 unit 1.00
2 ➜ numeric 1.00 housenumber 1.00
That means #A-2
is still present and can be classified
thanks, fixed
On Mon, Jun 22, 2020 at 12:24 PM Jones Magloire notifications@github.com wrote:
@Joxit approved this pull request.
Much better 👍
In classifier/UnitClassifier.js https://github.com/pelias/parser/pull/104#discussion_r443680051:
// If the previous token in a unit word, like apt or suite // and this token is something like A2, 3b, 120, A, label it as a unit (number)
if (hasPrevUnitToken && combinedUnitRegexp.test(span.body)) {
if (hasPrevUnitTypeToken && maybeUnitToken) {
span.classify(new UnitClassification(1))
}
// A unit token that starts with '#' is always a unit if it's not the first token
A token that starts with '#"*... ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/parser/pull/104#pullrequestreview-435084817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMBURKMR6P35EW3HVXDRX6AUBANCNFSM4N2PL35A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
LGTM
We saw a bunch of queries like this in our incoming data and I wanted to fix them in parser