Closed yoavweiss closed 1 year ago
This seems reasonable -- I might argue for 48h rather than 24, just to avoid any weird timing issues for a site that is generally visited once per day (where a 23h gap between visits might produce significantly different pattens than a 25h gap, for instance) but otherwise I think we could make this work.
Essentially, a policy would be considered "stale" at a certain pre-determined time after it was created (every visit that includes a NEL header replaces the existing policy, so the creation time is usually being kept fresh). If a network failure that occurs when a policy is stale, and that causes a report to be generated, the policy will be immediately removed from the policy cache after generating the report.
For stale subdomain policies, this means that a DNS error on a subdomain of the origin that set the policy will trigger a report, and remove the policy (at the higher origin level). Any subsequent successful visit to the higher origin can reinstate the policy. A non-DNS error on a subdomain does not generate an report, and so would not remove the policy.
I'll see if I can get some stats on how often this would actually affect report generation. It would also be good to get some feedback from folks who use the feature, whether this change makes sense.
The original vulnerability is about a user that visits an adversary network that can deploy a MitM attack. A user might expect that something is wrong in the network and act with caution. Once the user moves to a different network, the user does not expect that the artifacts from the original network might continue to track the user.
The proposed change limits the time window during which the attack works but does not prevent the attack completely.
Given that process policy headers algorithm aborts if the origin is not potentially trustworthy, an adversary network is not part of the threat model. If such a network can deploy an MitM attack, there are deeper concerns for that user (e.g. Service Workers that persist for 24H, long term resources in caches with no time limit, etc).
Depending on the adversary's capabilities, there are different mitigations --
136 and the related paper point out real risk of temporary origin vulnerabilities becoming permanent, using the NEL cache to exfiltrate reports about user behavior even after the vulnerability was fixed.
I believe we can solve this issue by making the NEL cache time bound (e.g. for 24H), while maintaining Stale-While-Revalidate semantics after that timeout expired (enabling one last report in case of failures before reconfirming NEL headers from the origin).