umputun / remark42

comment engine
https://remark42.com
MIT License
4.92k stars 386 forks source link

Support restricted words #237

Closed umputun closed 5 years ago

umputun commented 5 years ago

The goal is to allow a list of filtered words to be defined and passed in. Attempt to post a comment with those words will be rejected.

amelnikov-mylivn commented 5 years ago

Some questions regarding this issue:

  1. Should a list of words be passed as configuration or should we keep it in the database and provide full CRUD API for them?
  2. Should we compare each word in a comment for an exact match or should we implement something more sophisticated? Should we support stopwords in formats like abc*, *bcd*, *xyz, a?cd?
  3. Are there any future plans to use some advanced NLP techniques? Like stemming, POS tagging/filtering, language detection? Or we are keeping it simple?

The simplest solution looks like:

umputun commented 5 years ago

simplest sounds good, just a few details:

umputun commented 5 years ago

@amelnikov-mylivn is this something you planning to implement?

amelnikov-mylivn commented 5 years ago

@umputun Yes. I planned it for this weekend. Last weekend was surprisingly busy.

amelnikov-mylivn commented 5 years ago

I just want to check with you about interfaces and component layout before I go deeper. I created interfaces and naive implementations for some classes (I plan to add other implementations f.e. for Matcher). Can you please check this branch https://github.com/umputun/remark/compare/master...amelnikov-mylivn:feature-support-restricted-words ?

I have 4 interfaces right now:

  1. Tokenizer - extracts words from raw text
  2. Matcher - matches words from comments against restricted words
  3. Lister - returns raw list of words for specified siteID
  4. Checker - uses all above interfaces inside to check comment text against restricted words for specified siteID. Note matcherFactory (got better name? this is so javish). It exists because: a) creation of Matcher usually involves constructing some underlying datastructure and we do not want to create it each time we want to check a comment; b) there can be more than one implementation.

This is how we can construct the checker:

NewRestrictedWordsChecker(WhitespaceTokenizer{}, SimpleLister{restrictedWords}, func(words []string) Matcher {
    return NewSimpleMatcher(words)
})

If you have any suggestions or critics about code style I would also like to hear them.

umputun commented 5 years ago

A couple of things:

  1. Reviewing a branch from another repo is kind of hard to do. I mean I see it visually, but VSCode can't work with this as nicely as with PR. So, next time just make PR pls, will be much easier to work with.
  2. 4 interfaces for something that simple (plus a factory) feels like overkill to me. I had in mind much easier implementation (matcher struct, and a single injected Lister). I would suggest reducing your abstraction to this minimal level and get rid of factory/constructor. I.e. matcher will be created as simple as service.Matcher{Lister: store.StaticLister{Words: opts.Restricted.Static.Words}

If someday we need to customize checker or tokenizer we could do it by introducing interface or functional argument, but for now, we have no such need

amelnikov-mylivn commented 5 years ago
  1. got it
  2. Basically you want to inline logic from current Tokenizer, Matcher and Checker into a single struct Matcher that will have just a single public method Match(siteID string, text string) bool?
umputun commented 5 years ago

yes

amelnikov-mylivn commented 5 years ago

This pull request https://github.com/umputun/remark/pull/257 is just a followup of our discussion - not a final version

amelnikov-mylivn commented 5 years ago

I think I finished simple implementation of this feature. Just to be on the same page - I wanted to try to add matching that allows to specify restricted words with masks like abc*, *bcd*, *xyz, a?cd. Is it something useful for the project or you think simple equals matching is good enough?

umputun commented 5 years ago

Thx again. I have reordered Match a little bit to eliminate potentially unnecessary tokenize call if the list empty or can't be loaded.