roehling / postsrsd

Postfix Sender Rewriting Scheme daemon
325 stars 39 forks source link

srs_reverse sometimes fails with "Hash invalid in SRS address" #68

Closed mundschenk-at closed 3 years ago

mundschenk-at commented 7 years ago

I've noticed that sometimes (but not always), postsrsd fails to do srs_reverse for bounces with the message Hash invalid in SRS address. At first I assumed it had something to do with case folding, but upon further investigation, that's probably not the reason.

srs_forward: <foo@some.domain> rewritten as <SRS0=cl8D=Z5=some.domain=foo@my.domain>
...
srs_reverse: <srs0=cl8d=z5=some.domain=foo@my.domain> not rewritten: Hash invalid in SRS address.
...
srs_forward: <bar@some.other.domain> rewritten as <SRS0=7mIz=Z5=some.other.domain=bar@my.domain>
...
srs_reverse: <srs0=7miz=z5=some.other.domain=bar@my.domain> rewritten as <bar@some.other.domain>

Why does the srs_reverse fail in the first instance and not in the second one?

mundschenk-at commented 7 years ago

@roehling Have you got any idea why postsrsd would fail to decrypt rewritten addresses created by itself just moments before?

roehling commented 7 years ago

I haven't seen this problem before. Which version are you using?

mundschenk-at commented 7 years ago

mail/postsrsd from FreeBSD ports, which is still at 1.3 apparently.

mundschenk-at commented 7 years ago

@roehling I've experimented a bit more and been able to reproduce the issue with arbitrary test strings. When the domain part of the "address" is 16 characters long, the first call in a row to get the SRS version returns an invalid hash.

telnet 127.0.0.1 10001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get x@123456789abcdefg
200 SRS0=2i4M=2G=123456789abcdefg=x@my-domain
get x@123456789abcdefg
200 SRS0=SU44=2G=123456789abcdefg=x@my-domain
get x@123456789abcdefg
200 SRS0=SU44=2G=123456789abcdefg=x@my-domain
telnet 127.0.0.1 10002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get SRS0=2i4M=2G=123456789abcdefg=x@my-domain
500 Hash invalid in SRS address.
get SRS0=SU44=2G=123456789abcdefg=x@my-domain
200 x@123456789abcdefg
roehling commented 7 years ago

Interesting, I will investigate this. Thanks for your help!

mundschenk-at commented 7 years ago

FYI: The FreeBSD port has been updated to 1.4, but the issue is still reproducible (not surprising since a patched-for-FreeBSD 1.3 was not that different from vanilla 1.4 from what I saw in the commit history).

mundschenk-at commented 7 years ago

Have you been able to find anything? Is there something I can do to help?

mundschenk-at commented 7 years ago

@roehling Any news on this?

roehling commented 7 years ago

I have tried quite a few things, but I've been unable to reproduce this:

$ telnet localhost 10001
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
get x@123456789abcdefg
200 SRS0=JOZt=4A=123456789abcdefg=x@example.com
get x@123456789abcdefg
200 SRS0=JOZt=4A=123456789abcdefg=x@example.com
$ telnet localhost 10002
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
get SRS0=JOZt=4A=123456789abcdefg=x@example.com
200 x@123456789abcdefg
mundschenk-at commented 7 years ago

It might be some weird interaction with compiler optimizations.

When we switched to a new server recently (same OS, but newer Intel platform), I enabled CPU specific optimizations. I noticed that the error still occured, but not with the same trigger that was reproducible on the old server. I then disabled CPU specific optimizations and the error can be reproduced again with x@123456789abcdefg.

It's been a while since I've coded C, so I doubt I'll find anything. Are there any constructs that could be susceptible to unwanted "optimization" by clang? (This is FreeBSD 11 with clang 3.8, BTW.)

roehling commented 7 years ago

I have added a test framework to simplify the bug hunt. On my computer, the test succeeds both with GCC and clang 3.8. It would be interesting to see if using -O2 instead of -O3 (or even -O0 for that matter) fixes the problem.

mundschenk-at commented 7 years ago

I tried different optimization levels (set in make.conf): CFLAGS=-O2 (system default): bug occurs CFLAGS=-O0: no bug CFLAGS=-O3: no bug

Even more interesting: The generated SRS hashes are identical (including getting different hashes on the first and subsequent requests with the same connection). However, with the explicit CFLAGS, both can be decoded!

System default:

$ telnet localhost 10001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get x@123456789abcdefg
200 SRS0=6jzY=4A=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=fn8W=4A=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=fn8W=4A=123456789abcdefg=x@example.org
$ telnet localhost 10002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get SRS0=6jzY=4A=123456789abcdefg=x@polis.or.at
500 Hash invalid in SRS address.
get SRS0=fn8W=4A=123456789abcdefg=x@polis.or.at
200 x@123456789abcdefg

With CFLAGS=-O3:

$ telnet localhost 10001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get x@123456789abcdefg
200 SRS0=6jzY=4A=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=fn8W=4A=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=fn8W=4A=123456789abcdefg=x@example.org
$ telnet localhost 10002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get SRS0=6jzY=4A=123456789abcdefg=x@example.org
200 x@123456789abcdefg
get SRS0=fn8W=4A=123456789abcdefg=x@example.org
200 x@123456789abcdefg

Update: It's the -O2, the other flags are added by the system anyway and don't change the result.

mundschenk-at commented 7 years ago

@roehling So we's have to look at the decoding function, it would seem? I'll try the new test framework tonight.

mundschenk-at commented 7 years ago

@roehling I just ran the test harness with 3 different settings for CFLAGS. The output is identical for all of them. What should the output look like if correct?

roehling commented 7 years ago

The test generates random email addresses with different lengths and tests whether the SRS transformation works, i.e. valid rewritten addresses are transformed back and modified ones are rejected.

roehling commented 7 years ago

make test should output whether or not the tests passed or failed.

mundschenk-at commented 7 years ago

Ah! I just ran the executable. Well, the tests all pass, regardless of the CFLAGS.

mundschenk-at commented 7 years ago

What still looks a bit weird to me: When you connect to forward-Server, you get the same hash for multiple GET calls. When I do that, the first call always results in a different hash than subsequent calls. Why is that?

roehling commented 7 years ago

I have no idea, and I suspect it is the root cause of the bug on your system. I'm a little bit stretchted for time right now, but I'm going to try and reproduce the weird behavior with -O2 as soon as I can.

mundschenk-at commented 7 years ago

At first I had assumed it was timestamp issue, but it obviously is not. Please take note that this happens regardless of the CFLAGS.

I just temporarily installed the compiled master branch and got even weirder behavior:

telnet localhost 10001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get x@123456789abcdefg
200 SRS0=Rs9s=4B=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=2YsH=4B=123456789abcdefg=x@example.org
get x@123456789abcdefg
200 SRS0=Rs9s=4B=123456789abcdefg=x@example.org

Another thing just came to my mind, and maybe this may be at least proximate cause: I'm using LibreSSL on my machine. Just to check, I tried compiling with plain OpenSSL in a VM now and it appears that the strange behavior does not happen in that case. As far as I can see, there is only a single include from OpenSSL (line 26 of srs2.c)?

Mhm... but you are #undefing that, so it should not matter? Strange.

mundschenk-at commented 7 years ago

OOOkay. I think I finally found out what the real issue is/was. I had a look at my secrets file. The production one is pretty ancient and looked ... weird. Repeated, alternating lines, etc. When I generated a new secret string that was less than 80 characters long, I would not get the strange differing hashes. When I switched back to the old secrets file, boom, the behavior was back.

The old file had 9400 bytes. I'll try if I can identify a single secret line that triggers the odd behavior or if it's just the large file that creates problems.

mundschenk-at commented 7 years ago

@roehling I've found that a secret line of 114 bytes triggers the odd behavior (presumably a buffer overflow somewhere). Can you try if it is reproducible for you with this string?

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqr stuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY

remkolodder commented 7 years ago

I seem to be hitting something similiar with a few users at the moment.

OS: FreeBSD SW: Postfix 3.2.2 SW: Postsrsd 1.4

I am using one secret in postsrsd.secret which is 74 characters "long". I updated the hash since I read above that that might do the job. It's now 65 characters long..

troublestarter commented 6 years ago

same problem here. Nov 14 16:38:56 relay1 postsrsd[31474]: srs_reverse: SRS0=GpBe=NZ=gmail.com=sav.garidech1@my-XXXX.XYZ not rewritten: Hash invalid in SRS address.

It is a stuck mail before i install opensrsd.

How can i send the mail again without this error message ?

troublestarter commented 6 years ago

@roehling Having lot's of not rewritten: Hash invalid in SRS address. Have you an idea of this problem ?

Last Centos version. Last postfi version. Last opendkim version and last master branch of postsrsd version.

Important note : The SRS might be generated by a third party server ( a customer that using our mail server as relay for outgoing mails ). Why cant it be decode ?

[root@relay1 ~]# telnet 127.0.0.1 10002 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. get SRS0=QS1C=N2=equipjardin.com=e.breton@XXXX.XYZ 500 Hash invalid in SRS address.

roehling commented 3 years ago

Does this problem still occur with the latest release?

mundschenk-at commented 3 years ago

I'm still on 1.6 (FreeBSD ports version) and haven't had the issue on my life system after cleaning up the secrets file (i.e. making sure that the secret is less than 114 bytes). Have you been able to reproduce the issue with the secret in https://github.com/roehling/postsrsd/issues/68#issuecomment-298186084?

roehling commented 3 years ago

Yes, I have reproduced it, and I think I found the underlying cause as well.