yoavweiss commented 1 year ago

136 and the related paper point out real risk of temporary origin vulnerabilities becoming permanent, using the NEL cache to exfiltrate reports about user behavior even after the vulnerability was fixed.

I believe we can solve this issue by making the NEL cache time bound (e.g. for 24H), while maintaining Stale-While-Revalidate semantics after that timeout expired (enabling one last report in case of failures before reconfirming NEL headers from the origin).

clelland commented 1 year ago

This seems reasonable -- I might argue for 48h rather than 24, just to avoid any weird timing issues for a site that is generally visited once per day (where a 23h gap between visits might produce significantly different pattens than a 25h gap, for instance) but otherwise I think we could make this work.

Essentially, a policy would be considered "stale" at a certain pre-determined time after it was created (every visit that includes a NEL header replaces the existing policy, so the creation time is usually being kept fresh). If a network failure that occurs when a policy is stale, and that causes a report to be generated, the policy will be immediately removed from the policy cache after generating the report.

For stale subdomain policies, this means that a DNS error on a subdomain of the origin that set the policy will trigger a report, and remove the policy (at the higher origin level). Any subsequent successful visit to the higher origin can reinstate the policy. A non-DNS error on a subdomain does not generate an report, and so would not remove the policy.

I'll see if I can get some stats on how often this would actually affect report generation. It would also be good to get some feedback from folks who use the feature, whether this change makes sense.

polcak commented 1 year ago

The original vulnerability is about a user that visits an adversary network that can deploy a MitM attack. A user might expect that something is wrong in the network and act with caution. Once the user moves to a different network, the user does not expect that the artifacts from the original network might continue to track the user.

The proposed change limits the time window during which the attack works but does not prevent the attack completely.

yoavweiss commented 1 year ago

Given that process policy headers algorithm aborts if the origin is not potentially trustworthy, an adversary network is not part of the threat model. If such a network can deploy an MitM attack, there are deeper concerns for that user (e.g. Service Workers that persist for 24H, long term resources in caches with no time limit, etc).

clelland commented 1 year ago

Depending on the adversary's capabilities, there are different mitigations --

If the transport is not secure (SSL), then NEL does not apply at all; no configuration will be stored, and no reports will be sent. This should cover the cases where the attacker controls some part of the network, but can't forge certificates.
If the transport is secure, but the attacker can make the requests fail (DNS MITM, router misconfiguration, backhoe-to-the-fibre-attacks) then #147 will ensure that failure reports are only delivered for a limited time, after which the configuration will be expired. This limits the time for an attack, and importantly, also correctly signals to the origin server that there is a problem (the attacker can't replace the policy, so only the original policy should apply,) and this is how NEL is supposed to work to signal connectivity problems.
If the attacker can somehow both MITM the DNS, and set up a server at an IP address under their control, and can also forge certificates, then they could replace the policy with a new one. (This is already getting into "the user has bigger problems to worry about" territory) However, as soon as the user gets onto a clean network and contacts the origin server, the adversary's policy will be overwritten. If the targeted origin doesn't deploy NEL, then the fact that the server's IP has changed means that any requests will be downgraded to DNS errors, only failure reports will be sent (no success reports,) and #147 will ensure that the window where reports can be sent is brief.
If the attacker can somehow both MITM the network, and can also forge certificates, and spoof the origin server, on it's correct IP address, then they very likely have complete control over the victim's entire view of the network, and as Yoav says, this is outside the scope of any mitigation possible here or in many other features of the web. Even here, though, we have some of the protections in the previous bullet: If the targeted origin deploys NEL, then the attacker's configuration will be replaced as soon as the user visits them on a clean network, and if they do not deploy NEL, then #147 at least keeps the time window for any other requests to be sent short.

w3c / network-error-logging

NEL cache should be time-bound #139

136 and the related paper point out real risk of temporary origin vulnerabilities becoming permanent, using the NEL cache to exfiltrate reports about user behavior even after the vulnerability was fixed.