jo3-l / obscenity

Robust, extensible profanity filter for NodeJS
MIT License
82 stars 5 forks source link

Bug: `collapseDuplicatesTransformer` does not collapse the last letter #77

Open rion18 opened 3 months ago

rion18 commented 3 months ago

Expected behavior

Using obscenity to censor a string containing repeating characters such as pppiiittt and a dataset that contains the word pit.

Using:

collapseDuplicatesTransformer({
  defaultThreshold: 1,
}),

I would expect the whole pppiiittt word to be matched.

Actual behavior

Instead, only the first t is detected, matching pppiiit. The final two t are "not a part of the profanity", while they should be.

Minimal reproducible example

const {
  englishDataset,
  parseRawPattern,
  DataSet,
  RegExpMatcher,
  collapseDuplicatesTransformer,
} = require('obscenity');

const data = new DataSet()
    .addAll(englishDataset)
    .addPhrase(phrase => 
      phrase
        .setMetadata({ originalWord: 'pit' })
        .addPattern(parseRawPattern('pit'))
    ).build();

const transformers = {
  blacklistMatcherTransformers: [
    collapseDuplicatesTransformer({
      defaultThreshold: 1,
    }),
  ],
  whitelistMatcherTransformers: [],
};

const matcher = new RegExpMatcher({
    ...profanityDataset,
    ...transformers,
  });

const stringPit = 'ppiitt';
if (matcher.hasMatch(stringPit)) {
  const matches = matcher.getAllMatches(stringPit, true);
  return textCensor.applyTo(stringPit, matches);
}
return stringPit;

Steps to reproduce

No response

Additional context

No response

Node.js version

18.17.1

Obscenity version

0.4.0

Priority

Terms