rspamd / rspamd.com

rspamd.com website.
https://rspamd.com
Creative Commons Attribution Share Alike 4.0 International
26 stars 126 forks source link

Clarify Redis statistics configuration since 1.7 #319

Closed hardware closed 6 years ago

hardware commented 6 years ago

Hi,

Can you clarify Redis statistics configuration and changes to make since 1.7 release ? Based on this commit is the following configuration valid ?

# local.d/statistic.conf

classifier "bayes" {

  tokenizer {
    name = "osb";
  }

  backend = "redis";
  min_tokens = 11;
  min_learns = 10;
  autolearn = true;

  # Use new schema (1.7+)
  new_schema = true;
  # Enable per user statistics
  per_user = true;
  # Expire bayes tokens
  expire = 100d;
  # Store not only probabilities, but full tokens, false by default
  #store_tokens = true;
  # Store bayes signatures
  #signatures = true;

  statfile {
    symbol = "BAYES_HAM";
    spam = false;
  }
  statfile {
    symbol = "BAYES_SPAM";
    spam = true;
  }

  learn_condition =<<EOD
return function(task, is_spam, is_unlearn)
  local prob = task:get_mempool():get_variable('bayes_prob', 'double')

  if prob then
    local in_class = false
    local cl
    if is_spam then
      cl = 'spam'
      in_class = prob >= 0.95
    else
      cl = 'ham'
      in_class = prob <= 0.05
    end

    if in_class then
      return false,string.format('already in class %s; probability %.2f%%',
        cl, math.abs((prob - 0.5) * 200.0))
    end
  end

  return true
end
EOD
}

My configuration prior rspamd 1.7 is available here.

moisseev commented 6 years ago

Yes, it looks valid, though min_learns = 10; seems too low. The only changes in configuration you need to enable new schema is

new_schema = true;
expire = 100d;

but you need to convert the database as well. The rspamadm configwizard statistic command will do it for you.

Also you can shorten the configuration dramatically if you use local.d/classifier-bayes.conf instead of local.d/statistic.conf.

hardware commented 6 years ago

Also you can shorten the configuration dramatically if you use local.d/classifier-bayes.conf instead of local.d/statistic.conf.

Yes that's what I wanted to do yesterday, this should work :

# local.d/classifier-bayes.conf

cache {
  backend = "redis";
}

backend = "redis";
min_learns = 50;
autolearn = true;
new_schema = true;
per_user = true;
expire = 100d;

statfile {
  symbol = "BAYES_HAM";
  spam = false;
}
statfile {
  symbol = "BAYES_SPAM";
  spam = true;
}

But you need to convert the database as well. The rspamadm configwizard statistic command will do it for you.

It does not work properly on my 2 mail servers, I always got this error :

rspamadm configwizard statistic

You have configured new schema for BAYES_SPAM/BAYES_HAM but your DB has old data
Do you wish to convert data to the new schema?[Y/n]: 
Expire time for new tokens  [default: 100d]: 
converted OK elements from symbol BAYES_SPAM
converted 42386 elements from symbol BAYES_HAM
error converting metadata for symbol BAYES_SPAM
Conversion failed
No changes found, the wizard is finished now

But I will open another issue on rspamd main repository if I can not fix this problem. Meantime, I ask if other people have this problem on this issue : https://github.com/hardware/mailserver/issues/228

moisseev commented 6 years ago

I think this should be enough: local.d/classifier-bayes.conf

backend = "redis";
min_learns = 50;
autolearn = true;
new_schema = true;
per_user = true;
expire = 100d;

Why don't you want to use the default min_learns (200). It seems quite sane.

It's suspicious: converted OK elements

hardware commented 6 years ago

Thank you :)

I think it might be useful to add this example in doc/configuration/statistic.md to clarify Redis statistics configuration with Rspamd 1.7.

Why don't you want to use the default min_learns (200). It seems quite sane.

The default min_learns isn't too much with per_user enabled and with small/medium sized mail servers ?

It's suspicious: converted OK elements

Yeah, very strange. How can I debug this ?

moisseev commented 6 years ago

The default min_learns isn't too much with per_user enabled and with small/medium sized mail servers ?

IMO it's better keep classifier disabled while it is underlearned. If you need it working immediately you can train it on spam corpuses. Regarding per user statistics I'd consider using two classifiers: per user and common.

Yeah, very strange. How can I debug this ?

Maybe you'll find something interesting in the Redis log? Or in the database itself? Ham elements have been converted, but spam elements not. There should be some difference between them.

Do you have enough RAM btw? For conversion process you need about 3x more RAM than Redis uses for old statisitics.

vstakhov commented 6 years ago

DB should be equal to a string not a number.