mailwatch / MailWatch

MailWatch for MailScanner is a web-based front-end to MailScanner
http://mailwatch.org/
GNU General Public License v2.0
117 stars 66 forks source link

The "Update SpamAssassin Rule Descriptions" doesn't properly show the sorted list of rules/descriptions after updating the ruleset. #1195

Closed pdwalker closed 3 years ago

pdwalker commented 4 years ago

Issue summary

The rules list in Update SpamAssassin Rule Descriptions are not sorted correctly. https://github.com/mailwatch/MailWatch/blob/1.2/mailscanner/sa_rules_update.php

Steps to reproduce

  1. go to "tools and links"
  2. select Update "SpamAssassin Rule Descriptions"
  3. click on "run now"

Expected result

The list of rules should be sorted in alphabetical order

Actual result

The list is rules is sorted incorrectly.

the reason

in ./mailscanner/sa_rules_update.php is a shall command to collect all the rule descriptions via grep and then sort them. However the files containing the rule descriptions have variable formats. The lines could have leading spaces/tabs or none.

Also, the delimiters between the keyword "description" and the rule name could have a variable number of spaces/tabs.

As it is a simple string sort, the rules don't actually end up in a correctly sorted order.

For example, here is a sample of the sorted output before it is converted to an html table:

        describe     T_ACH_CANCELLED_EXE   "ACH cancelled" probable malware
        describe    T_MIME_MALF        Malformed MIME: headers in body
        describe        __KAM_BODY_LENGTH_LT_1024       The length of the body of the email is less than 1024 bytes.
        describe        __KAM_BODY_LENGTH_LT_128        The length of the body of the email is less than 128 bytes.
        describe        __KAM_BODY_LENGTH_LT_256        The length of the body of the email is less than 256 bytes.
        describe        __KAM_BODY_LENGTH_LT_512        The length of the body of the email is less than 512 bytes.
      describe FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different
      describe HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different
    describe        MIXED_ES Too many es are not es
    describe      T_GB_HASHBL_BTC Message contains BTC address found on BTCBL
    describe HASHBL_EMAIL       Message contains email address found on the EBL
    describe HASHBL_EMAIL       Message contains email address found on the EBL
    describe PP_MIME_FAKE_ASCII_TEXT  MIME text/plain claims to be ASCII but isn't
    describe PP_TOO_MUCH_UNICODE02      Is text/plain but has many unicode escapes
    describe PP_TOO_MUCH_UNICODE05      Is text/plain but has many unicode escapes
    describe T_GB_FREEM_FROM_NOT_REPLY    From: and Reply-To: have different freemail domains
    describe USER_IN_WHITELIST            DEPRECATED: See USER_IN_WELCOMELIST
    describe USER_IN_WHITELIST_TO         DEPRECATED: See USER_IN_WELCOMELIST_TO
   describe    T_LARGE_PCT_AFTER_MANY   Many large percentages after...
  describe        URIBL_CSS_A      Contains URL's A record listed in the Spamhaus CSS blocklist
  describe        URIBL_SBL_A      Contains URL's A record listed in the Spamhaus SBL blocklist
  describe       TO_EQ_FM_DOM_SPF_FAIL    To domain == From domain and external SPF failed
  describe       TO_EQ_FM_SPF_FAIL    To == From and external SPF failed
  describe       T_FUZZY_OPTOUT             Obfuscated opt-out text
  describe      FUZZY_ANDROID       Obfuscated "android"
  describe      FUZZY_BITCOIN       Obfuscated "Bitcoin"
  describe      FUZZY_BROWSER       Obfuscated "browser"
  describe      FUZZY_BTC_WALLET    Heavily obfuscated "bitcoin wallet"
  describe      FUZZY_CLICK_HERE    Obfuscated "click here"
  describe      FUZZY_DR_OZ         Obfuscated Doctor Oz
  describe      FUZZY_IMPORTANT     Obfuscated "important"
  describe      FUZZY_PRIVACY       Obfuscated "privacy"
  describe      FUZZY_PROMOTION     Obfuscated "promotion"
  describe      FUZZY_SAVINGS       Obfuscated "savings"
  describe      FUZZY_SECURITY      Obfuscated "security"
  describe      FUZZY_UNSUBSCRIBE   Obfuscated "unsubscribe"
  describe      FUZZY_WALLET        Obfuscated "Wallet"
  describe      T_DOS_ZIP_HARDCORE        hardcore.zip file attached; quite certainly a virus
  describe     CTYPE_NULL          Malformed Content-Type header

If we make the output consistent, we can get the sort order correct. This is how I fixed it for my installation https://github.com/mailwatch/MailWatch/blob/1.2/mailscanner/sa_rules_update.php:61

From:

        $fh = popen(
            "grep -hr '^\s*describe' " . SA_RULES_DIR . ' /usr/share/spamassassin /usr/local/share/spamassassin ' . SA_PREFS . ' /etc/MailScanner/spam.assassin.prefs.conf /opt/MailScanner/etc/spam.assassin.prefs.conf /usr/local/etc/mail/spamassassin /etc/mail/spamassassin /var/lib/spamassassin 2>/dev/null | sort | uniq',
            'r'

To:

        $fh = popen(
            "grep -hr '^\s*describe' " . SA_RULES_DIR . ' /usr/share/spamassassin /usr/local/share/spamassassin ' . SA_PREFS . ' /etc/MailScanner/spam.assassin.prefs.conf /opt/MailScanner/etc/spam.assassin.prefs.conf /usr/local/etc/mail/spamassassin /etc/mail/spamassassin /var/lib/spamassassin 2>/dev/null | sed -e \'s/^[ \t]*describe[ \t]*/describe\t/i\' | sort | uniq',
            'r'

e.g. added

| sed -e \'s/^[ \t]*describe[ \t]*/describe\t/i\'

before the

| sort

by using sed to strip the leading whitespace before the description keyword, and then replacing the variable whitespace after the description keyword with a single tab, we can now sort the data consistently.

describe        ACCESSDB Message would have been caught by accessdb
describe        ACT_NOW_CAPS            Talks about 'acting now' with capitals
describe        AC_BR_BONANZA  Too many newlines in a row... spammy template
describe        AC_DIV_BONANZA Too many divs in a row... spammy template
describe        AC_FROM_MANY_DOTS           Multiple periods in From user name
describe        AC_HTML_NONSENSE_TAGS   Many consecutive multi-letter HTML tags, likely nonsense/spam
describe        AC_POST_EXTRAS              Suspicious URL
describe        AC_SPAMMY_URI_PATTERNS1 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS10 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS11 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS12 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS2 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS3 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS4 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS8 link combos match highly spammy template
describe        AC_SPAMMY_URI_PATTERNS9 link combos match highly spammy template

Alternatively, you could strip out everything before the rule name, but you'd have to then alter the table generation code for the rule descriptions as the column numbers of the results would be reduced by 1.

pdwalker commented 4 years ago

minor issue, cosmetic.