Support restricted words

umputun commented 5 years ago

The goal is to allow a list of filtered words to be defined and passed in. Attempt to post a comment with those words will be rejected.

amelnikov-mylivn commented 5 years ago

Some questions regarding this issue:

Should a list of words be passed as configuration or should we keep it in the database and provide full CRUD API for them?
Should we compare each word in a comment for an exact match or should we implement something more sophisticated? Should we support stopwords in formats like abc*, *bcd*, *xyz, a?cd?
Are there any future plans to use some advanced NLP techniques? Like stemming, POS tagging/filtering, language detection? Or we are keeping it simple?

The simplest solution looks like:

Pass stopwords as configuration.
When the server starts create an in-memory data structure for fast matching (hash set for the exact match, trie for matching with masks).
Pass comment text through white space tokenizer.
Check if any of tokens have a match in our prepared data structure.

umputun commented 5 years ago

simplest sounds good, just a few details:

this can be a part of store.service package because it will be used by create and edit func. Another possible place is rest_private, but service feels like a better choice for me.
I'd split implementation into two parts - Lister interface with some List(siteID string) (restricted []string, err error) and Matcher struct accepting comment and returns a bool. Obviously Lister will be injected into Matcher
it can be wired with Type in cmd.ServerCommand, similar to other option groups we already have
probably we can use even simpler setup without configuration file for now - just a list of params/env. We do it for admin ids

umputun commented 5 years ago

@amelnikov-mylivn is this something you planning to implement?

amelnikov-mylivn commented 5 years ago

@umputun Yes. I planned it for this weekend. Last weekend was surprisingly busy.

amelnikov-mylivn commented 5 years ago

I just want to check with you about interfaces and component layout before I go deeper. I created interfaces and naive implementations for some classes (I plan to add other implementations f.e. for Matcher). Can you please check this branch https://github.com/umputun/remark/compare/master...amelnikov-mylivn:feature-support-restricted-words ?

I have 4 interfaces right now:

Tokenizer - extracts words from raw text
Matcher - matches words from comments against restricted words
Lister - returns raw list of words for specified siteID
Checker - uses all above interfaces inside to check comment text against restricted words for specified siteID. Note matcherFactory (got better name? this is so javish). It exists because: a) creation of Matcher usually involves constructing some underlying datastructure and we do not want to create it each time we want to check a comment; b) there can be more than one implementation.

This is how we can construct the checker:

NewRestrictedWordsChecker(WhitespaceTokenizer{}, SimpleLister{restrictedWords}, func(words []string) Matcher {
    return NewSimpleMatcher(words)
})

If you have any suggestions or critics about code style I would also like to hear them.

umputun commented 5 years ago

A couple of things:

Reviewing a branch from another repo is kind of hard to do. I mean I see it visually, but VSCode can't work with this as nicely as with PR. So, next time just make PR pls, will be much easier to work with.
4 interfaces for something that simple (plus a factory) feels like overkill to me. I had in mind much easier implementation (matcher struct, and a single injected Lister). I would suggest reducing your abstraction to this minimal level and get rid of factory/constructor. I.e. matcher will be created as simple as service.Matcher{Lister: store.StaticLister{Words: opts.Restricted.Static.Words}

If someday we need to customize checker or tokenizer we could do it by introducing interface or functional argument, but for now, we have no such need

amelnikov-mylivn commented 5 years ago

got it
Basically you want to inline logic from current Tokenizer, Matcher and Checker into a single struct Matcher that will have just a single public method Match(siteID string, text string) bool?

umputun commented 5 years ago

yes

amelnikov-mylivn commented 5 years ago

This pull request https://github.com/umputun/remark/pull/257 is just a followup of our discussion - not a final version

amelnikov-mylivn commented 5 years ago

I think I finished simple implementation of this feature. Just to be on the same page - I wanted to try to add matching that allows to specify restricted words with masks like abc*, *bcd*, *xyz, a?cd. Is it something useful for the project or you think simple equals matching is good enough?

umputun commented 5 years ago

Thx again. I have reordered Match a little bit to eliminate potentially unnecessary tokenize call if the list empty or can't be loaded.

umputun / remark42

Support restricted words #237