mrisher / smtp-sts

SMTP Strict Transport Security
Apache License 2.0
35 stars 19 forks source link

[UTA feedback] - Section 3.5 Updates #85

Closed jjones425 closed 7 years ago

jjones425 commented 8 years ago

From Stephen Farrell

jjones425 commented 8 years ago

From Viktor

3.5. Policy Application

When sending to an MX at a domain for which the sender has a valid non-expired SMTP MTA-STS policy, a sending MTA honoring SMTP MTA-STS MAY apply the result of a policy validation one of two ways:

  • report: In this mode, sending MTAs merely send a report to the designated report address indicating policy application failures. This can be done "offline", i.e. based on the MTA logs, and is thus a suitable low-risk option for MTAs who wish to enhance transparency of TLS tampering without making complicated changes to production mail-handling infrastructure.

This is misleading. Client MTAs that are unaware of STS policy, would not be in any position to log policy violations, and so also not able to generate meaningful reports. Reports require a non-trivial population of suitably enabled clients.

Still the claim that "report only" mode avoids production changes is largely unfounded. What report-only does is make it possible for the operators of receiving MTAs to test their policy before proceeding to make it mandatory (assuming they have working reporting and a sufficiently representative pool of STS-enabled reporting clients).

  • enforce: In this mode, sending MTAs SHOULD treat STS policy failures, in which the policy action is "reject", as a mail delivery error, and SHOULD terminate the SMTP connection, not delivering any more mail to the recipient MTA.

This fails to describe the behaviour when the nexthop domain has multiple MX hosts. It also fails to describe how to handle MX RRsets in which some MX hosts match the "mx" policy component, and some do not.

My suggestion would be that non-matching MX hosts and any MX hosts with a worse (higher) MX preference be removed from the MX RRset, leaving only matching hosts at an equal or better (lower) preference. Mail delivery can proceed only if that set is non-empty. If a first matching MX host fails authentication, then a second matching MX host is tried, ... until one passes, all are tried, or some sender limit on the number of MX hosts to try is reached. Mail is deferred if none of the attempted MX hosts pass authentication.

Naturally, if the sending MTA finds itself in the destination RRset, then it MUST remove all MX hosts with a preference equal or greater (worse) than its own preference, EVEN IF its own name does not match the "mx" field in the destination policy. Loop elimination trumps all other considerations.

Also "not delivering any more mail to the recipient MTA" is rather an overstatement. All one can reasonably say is that the message in question is not sent via the problem MX host. Later messages may well be sent via the problem MX host, provided it meets the policy requirements for those messages.

In enforce mode, however, sending MTAs MUST first check for a new authenticated policy before actually treating a message failure as fatal.

It is rather unclear how this is supposed to work in the presence of multiple MX hosts. When a first MX host fails, MUST the policy be refreshed there and then, or do we skip to the next MX host, and refresh the policy only when all fail? Secondly, this requirement makes implementation rather more complex. It is far simpler to defer the mail, and wait for a signal from an updated "id" in DNS. Receiving systems should use short TTLs on the TXT RRs that carry the "id" value. Refreshing the policy and trying the same message again synchronously is rather more complex. A sending MTA might however trigger a background policy refresh if the current policy was not cached "recently". A background refresh would limit the duration of any "outage" while holding a stale policy (negligent receiving system operator practice).

Thus the control flow for a sending MTA that does online policy application consists of the following steps:

  1. Check for cached non-expired policy. If none exists, fetch the latest, authenticate and cache it.

Only if a DNS TXT record signals that a policy is expected.

  1. Validate recipient MTA against policy. If valid, deliver mail.

This is ill-defined, since multi-MX behaviour is not described.

  1. If not valid and the policy specifies reporting, generate report.

Reporting is no longer specified in STS policies. Rather they just specify "soft" vs. "hard" failure, with mail delivered anyway in the former case, and any reports requested (separate draft) sent in either case.

  1. If not valid and policy specifies rejection, perform the following steps:
    • Check for a new (non-cached) authenticated policy.

Possibly in the background, with the current message deferred. Thus either a synchronous retry, or an implicit "stale id" signal, that triggers an asynchronous policy refresh.

  * If one exists and the new policy is different, update the current policy and go to step 2.

That's the synchronous behaviour. Also what happens when retrieval fails (connection timeout, failure to authenticate the HTTPS server, HTTPS error other than 404, ...)?

  * If one exists and the new policy is same as the cached policy, treat the delivery as a failure.

Again, synchronous behaviour. I would treat the delivery failure as transient (4XX) and queue the mail, and say so in the spec.

  * If none exists and cached policy is not expired, treat the delivery as a failure.

This does not seem right. What does "none exists" mean? If the HTTPS server returns an authenticated "404", then presumably the domain no longer implements STS, and the cached policy should be deleted! Hanging on to a stale no-longer published policy feels rather wrong.

Which I think suggests that "404" should be clearly specified as a mechanism to revoke all STS policy.

Remember that each policy has an expiration time (which SHOULD be long, on the order of days or months) and a validation method.

There is no longer a "validation method" (i.e. just PKIX, no DANE). The "mode" (enforce or report-only) if that's what is meant here, should be consistently called by some other name (failure mode?).

With these two mechanisms and the procedure specified in step 4,

What "mechanisms"?

recipients who publish a policy have, in effect, a means of updating a cached policy at arbitrary intervals, without the risks (of a man-in-the-middle attack) they would incur if they were to shorten the policy expiration time.

What makes timely refresh possible is primarily the combination of the "id" fields in the policy and the DNS TXT record, but that's not what's described above. Refresh on failure is only a last-resort in case of operator incompetence. A competent operator will ensure that the MX hosts always pass the currently published policy and any recently published policies whose max-age has not yet expired since the last time at which they were published. Such an operator will also use the TXT record "id" field to signal policy changes in a timely manner.

danmarg commented 7 years ago

Mostly addressed in my big refactor. I have, however, added a requirement to treat max_age=0 as explicit revocation in a257249.