bug: Kung Fu false positive

krasnoperov commented 4 months ago

Expected behavior

matcher.hasMatch('Kung-Fu') returns true

Actual behavior

matcher.hasMatch('Kung-Fu') returns false

Minimal reproducible example

import assert from 'node:assert'

import {
  englishDataset,
  englishRecommendedTransformers,
  RegExpMatcher,
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers,
})

assert.equal(matcher.hasMatch('Kung-Fu'), false)
assert.equal(matcher.hasMatch('Kung Fu'), false)
assert.equal(matcher.hasMatch('Kung Fu Panda'), false)

// This one actually works
assert.equal(matcher.hasMatch('KungFu'), false)

Steps to reproduce

Run the code above
It falls with assert exception

Additional context

No response

Node.js version

v20.15.0

Obscenity version

0.2.1

Priority

[X] Low
[ ] Medium
[ ] High

Terms

[X] I agree to follow the project's Code of Conduct.
[X] I have searched existing issues for similar reports.

jo3-l commented 4 months ago

The default dataset contains the pattern |fu|, which (correctly, but undesirably) matches on the -Fu in Kung-Fu. There are two potential ways we could fix this issue:

Remove the |fu| pattern entirely, or
Whitelist Kung-Fu and leave the |fu| pattern untouched.

I am leaning toward 2) at the moment: the |fu| pattern seems useful in general, and I cannot think of any other egregious false positives other than the instance you report. What do you think?

krasnoperov commented 4 months ago

I think that whitelisting Kung-Fu is a good option here. Also, it is possible to handle any future false positives by adding them to the whitelist as they arise.

jo3-l commented 4 months ago

I released v0.3.1 with the fix (please ignore v0.2.2 and v0.3.0, both of which were problematic due to my botching some release automation—sorry for the noise!). Thanks again for the report.

jo3-l / obscenity