pelias / parser

natural language classification engine for geocoding
https://parser.demo.geocode.earth
MIT License
55 stars 27 forks source link

performance: add circuit breakers in the ExclusiveCartesianSolver #157

Closed missinglink closed 2 years ago

missinglink commented 2 years ago

this simple PR adds two 'circuit breakers' within the ExclusiveCartesianSolver to mitigate the effects of long-running complex queries.

some (particularly long) queries can generate a large number of potential solutions, the main culprit is the ExclusiveCartesianSolver, which, as the name implies, produces a cross product of all non-overlapping classifications.

the problem is that in some degenerate cases (such as the example below) the sheer number of potential solutions can stall the CPU and therefore cascade performance issues on to any codebase which includes this module.

an obvious solution here is to add some 'circuit breakers' which detect if some of these arrays of values grow beyond a set threshold and prevents them from growing any larger.

the good news is that we can actually set these constants fairly high and still achieve our goals, the degenerate 'slow queries' are usually an order of magnitude slower, so simply providing any ceiling seems to be sufficient to prevent long running queries.

note: there is naturally a tradeoff here, we are gaining performance guarantees at the expense of discarding potentially superior solutions.

The way I've tuned this, all of the existing test cases pass (including the longer/complex examples).

So in a real-world scenario I suspect we will see no measurable difference in quality but with a significant reduction in worst-case performance, these constants may be adjusted/increased in the future if a need arises.

node bin/cli.js 'c mk v ba 1994 2000 323 c v ba 1994 2000 323 f iv bg 1987 1994 323 f mk iv bg 1987 1994 323 f mk v ba 1994 1998 323 f v ba 1994 1998 323 f vi bj 1998 2004 323 f/p mk vi bj 1998 2004 323 iii bf 1985 1991 323 1998 323 iii седан bf 1985 1991 323 iv bg 1989 1994 323 iv, седан'
# master
================================================================
SOLUTIONS (13962ms)
----------------------------------------------------------------
# this PR
================================================================
SOLUTIONS (112ms)
----------------------------------------------------------------
(0.10) ➜ [
  { street: 'v ba' },
  { housenumber: '1994' },
  { region: 'bg' },
  { country: 'mk' }
]
missinglink commented 2 years ago

added a third constant to limit the recursion depth, in testing I had to reduce this down to ~5 in order to get the tests to fail but it's nice to have regardless.

orangejulius commented 2 years ago

Cool, this looks great overall, and highly tunable.

I ran our full suite of tests against it and there are actually some changes in some tests. None in our core acceptance test suite, but in some of our other address test suites there are minor differences.

Amusingly, the changes are majority positive, but I think that's just coincidence. I figure the odds of excluding "correct" parses and "incorrect" is roughly even, or perhaps favorable to keeping better parses.

Most of the changes have to do with full address inputs, here are two examples:

Newly passing: /v1/autocomplete?text=29 East Arrowweed Way Washington UT 84780

Newly failing: /v1/autocomplete?text=90 Vista Place Mount Vernon NY 10550

Overall the impact is only about 0.1% of test results, so I don't think we have to worry about these. They generally only affect the final few characters of long autocomplete queries, meaning the difference in practice is basically zero.

Maybe someday we'll have to retune things, but not today 🤞