The list of rules should be sorted in alphabetical order
Actual result
The list is rules is sorted incorrectly.
the reason
in ./mailscanner/sa_rules_update.php is a shall command to collect all the rule descriptions via grep and then sort them. However the files containing the rule descriptions have variable formats. The lines could have leading spaces/tabs or none.
Also, the delimiters between the keyword "description" and the rule name could have a variable number of spaces/tabs.
As it is a simple string sort, the rules don't actually end up in a correctly sorted order.
For example, here is a sample of the sorted output before it is converted to an html table:
describe T_ACH_CANCELLED_EXE "ACH cancelled" probable malware
describe T_MIME_MALF Malformed MIME: headers in body
describe __KAM_BODY_LENGTH_LT_1024 The length of the body of the email is less than 1024 bytes.
describe __KAM_BODY_LENGTH_LT_128 The length of the body of the email is less than 128 bytes.
describe __KAM_BODY_LENGTH_LT_256 The length of the body of the email is less than 256 bytes.
describe __KAM_BODY_LENGTH_LT_512 The length of the body of the email is less than 512 bytes.
describe FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different
describe HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different
describe MIXED_ES Too many es are not es
describe T_GB_HASHBL_BTC Message contains BTC address found on BTCBL
describe HASHBL_EMAIL Message contains email address found on the EBL
describe HASHBL_EMAIL Message contains email address found on the EBL
describe PP_MIME_FAKE_ASCII_TEXT MIME text/plain claims to be ASCII but isn't
describe PP_TOO_MUCH_UNICODE02 Is text/plain but has many unicode escapes
describe PP_TOO_MUCH_UNICODE05 Is text/plain but has many unicode escapes
describe T_GB_FREEM_FROM_NOT_REPLY From: and Reply-To: have different freemail domains
describe USER_IN_WHITELIST DEPRECATED: See USER_IN_WELCOMELIST
describe USER_IN_WHITELIST_TO DEPRECATED: See USER_IN_WELCOMELIST_TO
describe T_LARGE_PCT_AFTER_MANY Many large percentages after...
describe URIBL_CSS_A Contains URL's A record listed in the Spamhaus CSS blocklist
describe URIBL_SBL_A Contains URL's A record listed in the Spamhaus SBL blocklist
describe TO_EQ_FM_DOM_SPF_FAIL To domain == From domain and external SPF failed
describe TO_EQ_FM_SPF_FAIL To == From and external SPF failed
describe T_FUZZY_OPTOUT Obfuscated opt-out text
describe FUZZY_ANDROID Obfuscated "android"
describe FUZZY_BITCOIN Obfuscated "Bitcoin"
describe FUZZY_BROWSER Obfuscated "browser"
describe FUZZY_BTC_WALLET Heavily obfuscated "bitcoin wallet"
describe FUZZY_CLICK_HERE Obfuscated "click here"
describe FUZZY_DR_OZ Obfuscated Doctor Oz
describe FUZZY_IMPORTANT Obfuscated "important"
describe FUZZY_PRIVACY Obfuscated "privacy"
describe FUZZY_PROMOTION Obfuscated "promotion"
describe FUZZY_SAVINGS Obfuscated "savings"
describe FUZZY_SECURITY Obfuscated "security"
describe FUZZY_UNSUBSCRIBE Obfuscated "unsubscribe"
describe FUZZY_WALLET Obfuscated "Wallet"
describe T_DOS_ZIP_HARDCORE hardcore.zip file attached; quite certainly a virus
describe CTYPE_NULL Malformed Content-Type header
by using sed to strip the leading whitespace before the description keyword, and then replacing the variable whitespace after the description keyword with a single tab, we can now sort the data consistently.
describe ACCESSDB Message would have been caught by accessdb
describe ACT_NOW_CAPS Talks about 'acting now' with capitals
describe AC_BR_BONANZA Too many newlines in a row... spammy template
describe AC_DIV_BONANZA Too many divs in a row... spammy template
describe AC_FROM_MANY_DOTS Multiple periods in From user name
describe AC_HTML_NONSENSE_TAGS Many consecutive multi-letter HTML tags, likely nonsense/spam
describe AC_POST_EXTRAS Suspicious URL
describe AC_SPAMMY_URI_PATTERNS1 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS10 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS11 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS12 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS2 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS3 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS4 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS8 link combos match highly spammy template
describe AC_SPAMMY_URI_PATTERNS9 link combos match highly spammy template
Alternatively, you could strip out everything before the rule name, but you'd have to then alter the table generation code for the rule descriptions as the column numbers of the results would be reduced by 1.
Issue summary
The rules list in Update SpamAssassin Rule Descriptions are not sorted correctly. https://github.com/mailwatch/MailWatch/blob/1.2/mailscanner/sa_rules_update.php
Steps to reproduce
Expected result
The list of rules should be sorted in alphabetical order
Actual result
The list is rules is sorted incorrectly.
the reason
in ./mailscanner/sa_rules_update.php is a shall command to collect all the rule descriptions via grep and then sort them. However the files containing the rule descriptions have variable formats. The lines could have leading spaces/tabs or none.
Also, the delimiters between the keyword "description" and the rule name could have a variable number of spaces/tabs.
As it is a simple string sort, the rules don't actually end up in a correctly sorted order.
For example, here is a sample of the sorted output before it is converted to an html table:
If we make the output consistent, we can get the sort order correct. This is how I fixed it for my installation https://github.com/mailwatch/MailWatch/blob/1.2/mailscanner/sa_rules_update.php:61
From:
To:
e.g. added
before the
by using sed to strip the leading whitespace before the description keyword, and then replacing the variable whitespace after the description keyword with a single tab, we can now sort the data consistently.
Alternatively, you could strip out everything before the rule name, but you'd have to then alter the table generation code for the rule descriptions as the column numbers of the results would be reduced by 1.