opnsense / plugins

OPNsense plugin collection
https://opnsense.org/
BSD 2-Clause "Simplified" License
842 stars 638 forks source link

OPNsense 19.7.7. / Rspamd: IPv6 IO write error - duplicated mails #1591

Closed shadowwaking closed 4 years ago

shadowwaking commented 4 years ago

After the upgrade from OPNsense 19.7.6 to 19.7.7, Rspamd, on my side, stoped working correctly. Mails from amazon and ebay are delivered twice or more often.

I think the problem is a timeout in postfix, because of a long delay in rspamd (you can see a 1 minute delay between 19:17:10 and 19:18:25 in rspamd.log)

I have removed and reinstalled rspamd -> same result. I also setup a clean OPNsense 19.7.7. -> same result. Go back to OPNsense 19.7.6. -> no IO errors in rspamd log, no errors in rspamd WebUI and no duplicated mails

In OPNsense changelog, no rspamd changes are listed. So i dont know where exactly the problem is. It looks like a IPV6 communication error. But other mails are proceeded just fine.

Rspamd.log:

2019-11-23 19:17:10 #21813(rspamd_proxy) <684cce>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 31382 2019-11-23 19:17:10 #21813(rspamd_proxy) <684cce>; milter; rspamd_milter_process_command: got connection from xx.xx.xx.xx:44620 2019-11-23 19:18:25 #21813(rspamd_proxy) <684cce>; proxy; proxy_backend_master_error_handler: abnormally closing connection from backend: ::1, error: IO write error: Broken pipe, retries left: 5 2019-11-23 19:18:25 #21813(rspamd_proxy) <684cce>; proxy; proxy_backend_master_error_handler: retry connection to: 127.0.0.1 retries left: 4 2019-11-23 19:18:25 #31636(normal) <406b83>; task; accept_socket: accepted connection from 127.0.0.1 port 55623, task ptr: 00000504B837F028

In Rspamd WebUI there are the following errors after restart:

hs_helper - error syncing log file: Bad file descriptor normal - error syncing log file: Bad file descriptor controller - error syncing log file: Bad file descriptor rspamd_proxy - error syncing log file: Bad file descriptor

I hope, anyone else can confirm the issue.

Thank you.

fabianfrz commented 4 years ago

@fichtner is there a newer version of the port in the tree?

mimugmail commented 4 years ago

It was updated to 2.1 from 1.9.4, there are small breaking .. I have to check tomorrow

mimugmail commented 4 years ago

Hm, I can't reproduce here. Is it only amazon/ebay? Do you know a better way to reproduce? Maybe you see more logs from postfix?

shadowwaking commented 4 years ago

:( No, the error is not just on mails from amazon or ebay. I can see the IO error on many mails. Just amazon and ebay are the one that have a problem with the timeout and are sending the mails often.

The error comes very, very often but not on every mail. This could be the thing, with round robin behaviour. Sometimes it uses the IPv4 localhost first (that works) and sometimes the IPv6 (that fails).

The error set on failure is always the same:

Bad case:

2019-11-23 21:03:44 #21813(rspamd_proxy) <377a74>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 18820 2019-11-23 21:03:45 #21813(rspamd_proxy) <377a74>; milter; rspamd_milter_process_command: got connection from xx.xx.xx.xx:27271 2019-11-23 21:05:00 #21813(rspamd_proxy) <377a74>; proxy; proxy_backend_master_error_handler: abnormally closing connection from backend: ::1, error: IO write error: Broken pipe, retries left: 5 2019-11-23 21:05:00 #21813(rspamd_proxy) <377a74>; proxy; proxy_backend_master_error_handler: retry connection to: 127.0.0.1 retries left: 4 2019-11-23 21:05:00 #31636(normal) ; task; accept_socket: accepted connection from 127.0.0.1 port 2545, task ptr: 00000504B837F028

Good case:

2019-11-23 16:05:03 #87961(rspamd_proxy) ; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 54921 2019-11-23 16:05:03 #87961(rspamd_proxy) ; milter; rspamd_milter_process_command: got connection from xx.xx.xx.xx:45569 2019-11-23 16:05:03 #198(normal) <2d6d1f>; task; accept_socket: accepted connection from 127.0.0.1 port 49429, task ptr: 0000044DDD5D1028

with OPNsense 17.7.6

it always looks so:

2019-11-20 10:33:50 #70227(rspamd_proxy) <33a1e8>; proxy; proxy_milter_finish_handler: finished milter connection 2019-11-20 10:34:09 #70227(rspamd_proxy) <098576>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 25061 2019-11-20 10:34:09 #70227(rspamd_proxy) <098576>; milter; rspamd_milter_process_command: got connection from xx.xx.xx.xx:25614 2019-11-20 10:34:10 #7700(normal) <030806>; task; accept_socket: accepted connection from 127.0.0.1 port 9764, task ptr: 000001FEC4DBCF00

I already deleted the 17.7.7 snapshot, so i dont get the postfix logs, anymore. But i looked in and cant see anything interesting there.

mimugmail commented 4 years ago

Can you open an issue at rspamd repo with the Error linking here and the information that you updated from 1.9.4 to 2.1

shadowwaking commented 4 years ago

Hi there,

it looks like that opnsense 19.7.7 sometimes resolves "localhost" to ipv4 and sometimes to ipv6. I have set the option "Prefer to use IPv4 even if IPv6 is available" but is it possible that this is broken or changed in 19.7.7 ?

Thank you.

mimugmail commented 4 years ago

The best way to check if this change was introduced by opnsense or rspamd is to upgrade to 19.7.7 and only downgrade rspamd to the state of 19.7.6. If the error persits, it's some change in opnsense, if it's gone, then something within rspamd. Vial CLI after update:

opnsense-revert -r 19.7.6 rspamd

shadowwaking commented 4 years ago

Ah, perfect. Thank you for this information. I will try this.

shadowwaking commented 4 years ago

Hi there,

I upgraded OPNsense to 19.7.7 (rspamd 2.1) and collected logs. In the first 1 hour after the upgrade everything runs fine. No IO write errors. But after 1 hour, i get the known errors. (abnormally closing connection from backend: ::1, error: IO write error: Broken pipe) and the delays. So i downgraded rspamd to 1.9.4. That was yesterday evening. And since there, no more errors. So it is very likely that rspamd is the problematic component. But the guys in the rpamd tree think, that is a ipv6 problem of opnsense. What indeed sounds also logical to me -> sometimes localhost resolves to ipv4 and sometimes to ipv6. But then, why cant rspamd handle both protocols correctly ?

Whatever, i think in this case, no one will find a solution, at the moment. I will stay at 19.7.7 with rspamd 1.9.4 and wait for the next updates. Eventually will it be fixed in a future update.

Thank you very much for your help.

mimugmail commented 4 years ago

Can you post the output of:

sockstat -6 | grep rspamd

please ...

shadowwaking commented 4 years ago

sure...

19.7.7 with rspamd 1.9 .# sockstat -6 | grep rspamd rspamd rspamd-1.9 14942 58 tcp6 ::1:4397 ::1:6379 rspamd rspamd-1.9 14942 59 tcp6 ::1:4129 ::1:6379 rspamd rspamd-1.9 11803 56 tcp6 ::1:47447 ::1:6379 rspamd rspamd-1.9 11803 57 tcp6 ::1:42254 ::1:6379 rspamd rspamd-1.9 99182 41 tcp6 ::1:17602 ::1:6379 rspamd rspamd-1.9 99182 42 tcp6 ::1:13396 ::1:6379

19.7.7 with rspamd 2.1 .# sockstat -6 | grep rspamd rspamd rspamd-2.1 33182 50 tcp6 ::1:8882 ::1:6379 rspamd rspamd-2.1 36635 48 tcp6 ::1:61024 ::1:6379 rspamd rspamd-2.1 36635 49 tcp6 ::1:57916 ::1:6379 rspamd rspamd-2.1 36635 50 tcp6 ::1:23628 ::1:6379 rspamd rspamd-2.1 7899 44 tcp6 ::1:12203 ::1:6379

mimugmail commented 4 years ago

Strange ... essential services are not running on v6 compared to mine:

root@OPNsense:~ # sockstat -6 | grep rspamd
rspamd   rspamd-2.1 27592 11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 27592 15 tcp6   ::1:11334             *:*
rspamd   rspamd-2.1 27592 19 tcp6   ::1:11333             *:*
rspamd   rspamd-2.1 99697 4  tcp6   ::1:10174             ::1:6379
rspamd   rspamd-2.1 99697 11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 99697 15 tcp6   ::1:11334             *:*
rspamd   rspamd-2.1 99697 19 tcp6   ::1:11333             *:*
rspamd   rspamd-2.1 57219 4  tcp6   ::1:26987             ::1:6379
rspamd   rspamd-2.1 57219 11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 57219 15 tcp6   ::1:11334             *:*
rspamd   rspamd-2.1 57219 19 tcp6   ::1:11333             *:*
rspamd   rspamd-2.1 54599 11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 54599 15 tcp6   ::1:11334             *:*
rspamd   rspamd-2.1 54599 34 tcp6   ::1:10106             ::1:6379
rspamd   rspamd-2.1 54599 35 tcp6   ::1:21795             ::1:6379
rspamd   rspamd-2.1 54599 36 tcp6   ::1:38055             ::1:6379
rspamd   rspamd-2.1 54599 37 tcp6   ::1:55401             ::1:6379
rspamd   rspamd-2.1 11222 11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 4107  11 tcp6   ::1:11332             *:*
rspamd   rspamd-2.1 4107  15 tcp6   ::1:11334             *:*
rspamd   rspamd-2.1 4107  19 tcp6   ::1:11333             *:*
shadowwaking commented 4 years ago

ok, that would explain something. But why is 1.9 working ?

For testing proposes i: deactivated System - Settings - General - " Prefer to use IPv4 even if IPv6 is available " and activated Firewall - Settings - Advanced - " Allow IPv6" restarted rspamd But nothing changes.

shadowwaking commented 4 years ago

That looks also not so good to me:

~ # sockstat -6

? ? ? ? tcp6 ::1:15755 ::1:6379 ? ? ? ? tcp6 ::1:62847 ::1:6379 ? ? ? ? tcp6 ::1:50730 ::1:6379 ? ? ? ? tcp6 ::1:62984 ::1:6379 ? ? ? ? tcp6 ::1:21690 ::1:6379 ? ? ? ? tcp6 ::1:47535 ::1:6379 ? ? ? ? tcp6 ::1:33395 ::1:6379 ? ? ? ? tcp6 ::1:49346 ::1:6379 ? ? ? ? tcp6 ::1:18833 ::1:6379

mimugmail commented 4 years ago

6379 is just the redis db. I have to check why your instance doesnt bind to v6

shadowwaking commented 4 years ago

Ok, Thank you. Please let me know, if i can do anything...

mimugmail commented 4 years ago

rspamd bind to localhost and postfix connects to localhost with a force of version 6. I'll add a dropdown to postfix antispam tab to choose the version (default to 6) so you can change it.

shadowwaking commented 4 years ago

Perfect. Thank you very much.

fhloston commented 4 years ago

I am still seeing this:

2020-08-13 09:16:34 #24765(rspamd_proxy) <3ef7fc>; proxy; proxy_backend_master_error_handler: retry connection to: ::1 retries left: 2 2020-08-13 09:17:49 #24765(rspamd_proxy) <3ef7fc>; proxy; proxy_backend_master_error_handler: abnormally closing connection from backend: [::1]:11333, error: IO write error: Broken pipe, retries left: 1 2020-08-13 09:17:49 #24765(rspamd_proxy) <3ef7fc>; proxy; proxy_backend_master_error_handler: retry connection to: 127.0.0.1 retries left: 1

This does happen only after some runtime. This only seems to happen to mime encoded mails. Plain seems to be unaffected.

I tried switching to milter protocol v4 but then rspamd complains about obsolete protocol 4....

mimugmail commented 4 years ago

Complains or does not work?

fhloston commented 4 years ago

Mails have several minutes delay, sometimes do not go through.

Qualifies as does not work for me, even more so as these mails contain OTPs for people to log on interactively.

derjohn commented 4 years ago

+1 here . We need rspamd for DKIM (because there is to opendkim in OPNsense).

mimugmail commented 4 years ago

You +1 and second part is misleading? Doesnt it work in general or so you wish dkim?

derjohn commented 4 years ago

Oh, I meant we just use rspamd for DKIM signature outbound (we don't use it inbound). I +1 fhloston ... Maybe the bug itself is related to dkim crypto (i.e. rspamd works for those who don'T use dkim?)

shadowwaking commented 4 years ago

Hi guys,

after days of troubleshooting i found out the the issue is related to DNS Resolution. (so DKIM is a good point) Ive fixed it by putting in my DNS Server in "/usr/local/etc/rspamd/options.inc". Sure, this is not upgrade persistent but i can live with that.

Bye

serial-bug

derjohn commented 4 years ago

@serial-bug We will try that too, using an external DNS like 1.1.1.1 ?

shadowwaking commented 4 years ago

If you use a local DNS Resolver like unbound or dnscrypt on your OPNsense, you can point directly to this. (eg. 127.0.0.1:53 or 127.0.0.1:5353). But a external DNS server schould work too.

mimugmail commented 4 years ago

Can you post your complete edit to the file? Then I can add it to GUI

shadowwaking commented 4 years ago

Sure. In the "Default Settings" Section ive added my local DNS (nameserver = ["xxxxx:xx"];).

# Default settings
dns_max_requests = 64;
max_lua_urls = 1024;
max_urls = 10240;

dns {
    nameserver = ["127.0.0.1:5353"];
    timeout = 1s;
    sockets = 16;
    retransmits = 5;
}
fhloston commented 4 years ago

I added the local unbound to the rspamd config: nameserver = ["127.0.0.1:53"];

I then validated in rspamadm configdump that it is active.

Sadly, this makes no difference, still 1:15 delay for the OTP mail:

2020-08-15 12:20:02 #9772(rspamd_proxy) <c28c18>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 60086 2020-08-15 12:20:02 #9772(rspamd_proxy) <c28c18>; milter; rspamd_milter_process_command: got connection from x.x.x.x:50383 2020-08-15 12:21:17 #9772(rspamd_proxy) <c28c18>; proxy; proxy_backend_master_error_handler: abnormally closing connection from backend: [::1]:11333, error: IO write error: Broken pipe, retries left: 4 2020-08-15 12:21:17 #9772(rspamd_proxy) <c28c18>; proxy; proxy_backend_master_error_handler: retry connection to: 127.0.0.1 retries left: 4 `

derjohn commented 4 years ago

Ehm, there would be a difference between pointing to 127.0.0.1:x and "use default" (which would probably end up in /etc/resolv) ?

derjohn commented 4 years ago

@fhloston Is :53 a process listening oder a NAT rule pointing to the real dns?

fhloston commented 4 years ago

Also using 1.1.1.1 makes no difference.

Since this mail relay is only used to relay these OTP messages I have whitelisted the IP of the sending machine in all other checks. Rspamd is only used to add the DKIM signature to the header.

2020-08-15 12:30:20 #9772(rspamd_proxy) <4a5eb1>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 30306 2020-08-15 12:30:20 #9772(rspamd_proxy) <4a5eb1>; milter; rspamd_milter_process_command: got connection from x.x.x.x:50387 2020-08-15 12:31:35 #9772(rspamd_proxy) <4a5eb1>; proxy; proxy_backend_master_error_handler: abnormally closing connection from backend: [::1]:11333, error: IO write error: Broken pipe, retries left: 4 2020-08-15 12:31:35 #9772(rspamd_proxy) <4a5eb1>; proxy; proxy_backend_master_error_handler: retry connection to: 127.0.0.1 retries left: 4 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_worker_body_handler: accepted connection from 127.0.0.1 port 29175, task ptr: 000005D79E430CA0 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_message_parse: loaded message; id: <undef>; queue-id: <2E4E959E91>; size: 331; checksum: <34beba9b8e38095cff23f14e783d044a> 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; greylist.lua:203: skip greylisting for local networks and/or authorized users 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; spf.lua:185: skip SPF checks for local networks and authorized users 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; dkim_symbol_callback: skip DKIM checks for local networks and authorized users 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; dmarc.lua:596: skip DMARC checks as either SPF or DKIM were not checked 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; once_received.lua:98: Skipping once_received for authenticated user or local network 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_redis_connected: skip obtaining bayes tokens for BAYES_SPAM of classifier bayes: not enough learns 0; 200 required 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_redis_connected: skip obtaining bayes tokens for BAYES_HAM of classifier bayes: not enough learns 0; 200 required 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_stat_classifiers_process: skip statistics as SPAM class is missing 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; greylist.lua:317: Score too low - skip greylisting 2020-08-15 12:31:35 #40530(normal) <c6db7e>; lua; neural.lua:478: cannot learn ham ANN default:default; redis_key: rn_default_default_1htmxf7z_0: sampled out with probability 1 (1 vectors stored) 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_task_write_log: id: <undef>, qid: <2E4E959E91>, ip: x.x.x.x, user: smtp_auth@domain, from: <noreply@domain>, (default: F (no action): [-47.60/15.00] [WHITELIST_SENDER_DOMAIN(-50.00){domain;},MISSING_MID(2.50){},MIME_GOOD(-0.10){text/plain;},ARC_NA(0.00){},ASN(0.00){asn:12345, ipnet:1.2.3.4, country:DE;},DKIM_SIGNED(0.00){},FROM_EQ_ENVFROM(0.00){},FROM_HAS_DN(0.00){},MIME_TRACE(0.00){0:+;},RCPT_COUNT_ONE(0.00){1;},RCVD_COUNT_ZERO(0.00){0;},TO_DN_NONE(0.00){},TO_MATCH_ENVRCPT_ALL(0.00){}]), len: 331, time: 10.650ms, dns req: 1, digest: <34beba9b8e38095cff23f14e783d044a>, rcpts: <user@otherdomain>, mime_rcpts: <user@otherdomain> 2020-08-15 12:31:35 #40530(normal) <c6db7e>; task; rspamd_protocol_http_reply: regexp statistics: 0 pcre regexps scanned, 0 regexps matched, 184 regexps total, 35 regexps cached, 0B scanned using pcre, 459B scanned total

fhloston commented 4 years ago

@fhloston Is :53 a process listening oder a NAT rule pointing to the real dns?

It is the local unbound. But this happens also with 1.1.1.1 as DNS.

root@fw01:/usr/local/etc/rspamd # sockstat -l | grep unbound root syslog-ng 82907 21 dgram /var/unbound/var/run/log unbound unbound 13456 3 udp6 *:53 *:* unbound unbound 13456 4 tcp6 *:53 *:* unbound unbound 13456 5 udp4 *:53 *:* unbound unbound 13456 6 tcp4 *:53 *:* unbound unbound 13456 7 tcp4 127.0.0.1:953 *:* unbound unbound 13456 8 dgram (not connected)

fhloston commented 4 years ago

And lastly, switching the milter protocol to IPv4 results in the mail passing through within 1s but containing no DKIM sig.

2020-08-15 12:41:55 #9772(rspamd_proxy) <8242d6>; proxy; proxy_accept_socket: accepted milter connection from 127.0.0.1 port 16992 2020-08-15 12:41:55 #9772(rspamd_proxy) <8242d6>; milter; rspamd_milter_process_command: MTA specifies too old protocol: 4, aborting connection 2020-08-15 12:41:55 #9772(rspamd_proxy) <8242d6>; proxy; proxy_milter_error_handler: abnormally closing milter connection from: 127.0.0.1:16992, error: invalid protocol version: 4

fhloston commented 4 years ago

Hang on - what does the output of rspamadm configdump represent? a) Parsed configfiles from disk without reflection of rspamd process state b) actual config of running rspamd process

I was assuming b) but thinking about it makes me feel it is a)...

fhloston commented 4 years ago

Yes, it is a).

So rerunning tests with complete stop/start of rspamd, as previous tests all had the same rspamd_proxy pid 9772.

fhloston commented 4 years ago

Good news, so far the issue has not resurfaced.

I have added nameserver = ["1.1.1.1"]; under dns {} in /usr/local/etc/rspamd/options.inc

fhloston commented 4 years ago

@mimugmail do i understand correctly, that the options.inc file is currently not rendered by the template system?

fhloston commented 4 years ago

Interestingly, as soon as i change it to nameserver = ["127.0.0.1"] the issue reappears after roughly 1 hour runtime!

So some kind of timeout error in combination with the local DNS resolver?

fhloston commented 4 years ago

Can you post your complete edit to the file? Then I can add it to GUI

are you still considering this @mimugmail ?

mimugmail commented 4 years ago

I'm now working on a PR ..

shadowwaking commented 4 years ago

Hi mimugmail,

it seems you already rolled out the change. Sadly, you missed to allow entering a port after the ip. People who dont use the standard port 53 for DNS (example if you use Unbound in front of DNSCrypt) have no chance

to enter a custom port.

It would be great if you could fix this.

Thank you. Screen

spec1re commented 3 years ago

For people still struggling on this issue, like me, I found a workaround for it.

Create a file in:

/usr/local/etc/rspamd/rspamd.conf.local.override

with this edited line:

...
worker "normal" {
    bind_socket = "*:11333";
    .include "$CONFDIR/worker-normal.inc"
    .include(try=true; priority=1,duplicate=merge) "$LOCAL_CONFDIR/local.d/worker-normal.inc"
    .include(try=true; priority=10) "$LOCAL_CONFDIR/override.d/worker-normal.inc"
}
...

This will make the Rspamd worker listen on *:11333 for ipv4 and ipv6, therefore the "IPv6 IO write error" will never happen.

I don't know what exactly the problem is, but I guess its a combination of Unbound is serving localhost with ::1 and how Rspamd binding the worker on localhost.

Regards

mimugmail commented 3 years ago

There is an open PR fixing this. Will come in one of the next releases