lefcha / imapfilter

IMAP mail filtering utility
MIT License
851 stars 93 forks source link

Match subject base64 #127

Open amerlyq opened 8 years ago

amerlyq commented 8 years ago

At my work mail server encodes whole subject to base64 if it contains at least one non-ascii character. This problem persists not only with me, if google for "decode mail subject" you can find many other servers. Currently I haven't found any way to force subject decoding by imapfilter, which completely eliminates usefullness of imapfilter for me. I can't easily replace match_subject with contain_subject because of spoken language structure when I need to match many word variations with regexes. Moreover in almost all cases subject is the single way to distinguish work-spam from useful work messages and urgent from pending, as I can't make such decision based on to/from/etc fields.

Would it be too much to ask for appropriate piece of code to add into imapfilter?) If you are really tight on time to write and test it (as everyone is), please, point me at places in code where I could start working to implement it myself.

lefcha commented 8 years ago

IIUC, neither contain_subject() nor match_subject() work if the the mail Subject is encoded this way? Are you trying to search using a word that is not encoded in base64?

I think it should not be that hard to add support for encoding/decoding strings as the OpenSSL library that is required by imapfilter already has a C API for doing that. But first lets clarify what you want to do, and what works/doesn't work...

amerlyq commented 8 years ago

Lets clarify: contain_subject() always works either for base64 or not. It's match_subject() which doesn't work. Consider next two formats of Subject in my mailbox which I can't match:

First:

=?UTF-8?B?0JrQvtC80LjRgdGB0LjRjyDQv9GA0Lgg0LzQtdC20LTRg9C90LDRgNC+?=
 =?UTF-8?B?0LTQvdGL0YUg0L/QtdGA0LXRh9C40LvQtdC90LjRj9GFINCh0J/QlA==?=
 =?utf-8?B?0J3Rg9C20L3QviDQv9C10YDQtdC00LDRgtGMINC/0L7RgdGL0LvQvtGH0Lo=?=
 =?utf-8?B?0YMg0LjQtyDQmtC40LXQstCwINCyINCc0L7RgdC60LLRgy4g0L/QvtC/0Ys=?=
 =?utf-8?B?0YLQutCwIOKEljI=?=

Second:

=?utf-8?Q?=D0=97_=D0=94=D0=BD=D0=B5=D0=BC_=D0=9D=D0=B0=D1=80=D0=BE=D0=B4=D0=B6=D0=B5=D0=BD=D0=BD=D1=8F=21?=
21 =?utf-8?Q?=D1=80=D1=96=D1=87=D0=BD=D0=B8=D1=86=D1=8F_?=Java!
 =?utf-8?Q?=D0=9E=D1=82=D1=87=D0=B5=D1=82_=D0=BF=D1=80=D0=BE_=D0=B8=D0=B3=D1=80=D1=83_?=19
 =?utf-8?Q?=D1=82=D1=83=D1=80=D0=B0_=D0=92=D1=82=D0=BE=D1=80=D0=BE=D0=B9_=D0=9B=D0=B8=D0=B3=D0=B8_=D0=9A=D0=90=D0=A4_
 =D0=9E=D1=82=D1=87=D0=B5=D1=82_=D0=BF=D1=80=D0=BE_=D0=B8=D0=B3=D1=80=D1=83_?=19
 =?utf-8?Q?=D1=82=D1=83=D1=80=D0=B0_=D0=92=D1=82=D0=BE=D1=80=D0=BE=D0=B9_=D0=9B=D0=B8=D0=B3=D0=B8_=D0=9A=D0=90=D0=A4_

One block = one subject. Some of them splitted in multiple lines in raw mail, being actually genuine oneline. Seems like terms B? and Q? represent different formats w/o and w/ = symbols.

lefcha commented 8 years ago

I see, I'll have to look into this when I have some time, as it looks useful to be able to match such Subject header fields...

onoraba commented 8 years ago

workaround with maildrop http://www.courier-mta.org/maildrop/, that works with base64 encoded headers and message body

maildrop configuration $ cat ~/.mailfilter if ( /^Subject:.*(путевка|тунис|романтика)/ ) { EXITCODE=5 exit } else { EXITCODE=0 exit } $

configuration test $ cat ~/spam/test | maildrop ; echo $? 5 $

example imapfilter part

`all = account1['mailbox']:match_to('(?i)all@') spam = Set {}

for _, mesg in ipairs(all) do mbox, uid = table.unpack(mesg) text = mbox[uid]:fetch_message() mail_status = pipe_to('maildrop', text) if (mail_status == 5) then table.insert(spam, mesg) end end

all = all - spam

spam:copy_messages(account1['spam']) spam:mark_deleted() spam = nil

all:copy_messages(account1['mailbox2']) all:mark_deleted() all = nil `

amerlyq commented 7 years ago

Also, it seems those names are conformant to rfc2047. So, despite its prohibited to use them now in mailing, they are still often guest in the wild. Like received from misconfigured Outlook, etc.

Cybolic commented 3 years ago

For what it's worth, I got around this by creating a match_utf8_field function that I call instead of match_field or match_subject.

I put it up here: https://paste.sr.ht/~cybolic/902986c795599f558165c63bcb65a3d4ae15881e

newhinton commented 1 year ago

This also affects the match_from method. It seems spam heavily relies on utf-8 encoding to bypass "simple" filters, and imapfilter also does not catch those.

How would i decode the header before it is passed to match_from?