There is in principle no robust way to map programmatic HTTP requests or responses (XMLHttpRequest and fetch() calls) to modified HTML page elements without performing full-scale taint analysis, which is much too heavy (and maybe impossible). For example:
Client: JavaScript code requests GET /first/url via XMLHttpRequest
Server: Sends the /first/url response
Client: JavaScript code requests GET /second/url via XMLHttpRequest
Server: Sends the /second/url response
Client: Waits 5 seconds
Client: Updates a <div> element with data from the /second/url response
Client: Waits another 5 seconds
Client: Updates a different <div> element with data from the /first/url response
A browser extension (or in general, any JavaScript code that the existing site's JavaScript does not have a dependency on) has no way to determine that the update in step 6 came from /second/url, while the update in step 8 came from /first/url.
But (as acknowledged in the README) this makes many strong assumptions (e.g., inline JavaScript; modification follows a call to getElementById() (rather than, e.g., querySelector()); the modified element appears as a string literal in this call) that can lead to both false positives and false negatives.
But this attaches provenance header data to the HTML element identified by originalMutation at the time the server request completes, which is necessarily before any HTML mutation that depends on that response, meaning that the accesses to originalMutation in the linked code will in fact refer to some (irrelevant) previous mutation. In particular, the first time the linked code runs, originalMutation will be undefined. Additionally, the code currently only tracks the most recently modified HTML element.
Solutions
Given that there is no ideal way to solve the problem, there are 2 possible ways forward:
Drop the requirement of showing which HTML elements were updated by a particular HTTP request, and just show, e.g., a small button that can be clicked to show a list of all HTTP responses resulting from the current page and having attached provenance data.
Attempt to map provenance-enriched HTTP responses to modified HTML elements in a more defensible, but still heuristic, way.
For now, to get something going, I'll go with option 1 -- something like this is needed in any case for handling non-JavaScript-initiated requests (e.g., full page loads). #5 (EDIT: originally given below) describes a straightforward approach that could be used to implement option 2 later.
The problem
There is in principle no robust way to map programmatic HTTP requests or responses (
XMLHttpRequest
andfetch()
calls) to modified HTML page elements without performing full-scale taint analysis, which is much too heavy (and maybe impossible). For example:GET /first/url
viaXMLHttpRequest
/first/url
responseGET /second/url
viaXMLHttpRequest
/second/url
response<div>
element with data from the/second/url
response<div>
element with data from the/first/url
responseA browser extension (or in general, any JavaScript code that the existing site's JavaScript does not have a dependency on) has no way to determine that the update in step 6 came from
/second/url
, while the update in step 8 came from/first/url
.Existing code attempts to solve this by either:
new XMLHttpRequest
andgetElementById()
:getElementById()
(rather than, e.g.,querySelector()
); the modified element appears as a string literal in this call) that can lead to both false positives and false negatives.MutationObserver
:originalMutation
at the time the server request completes, which is necessarily before any HTML mutation that depends on that response, meaning that the accesses tooriginalMutation
in the linked code will in fact refer to some (irrelevant) previous mutation. In particular, the first time the linked code runs,originalMutation
will beundefined
. Additionally, the code currently only tracks the most recently modified HTML element.Solutions
Given that there is no ideal way to solve the problem, there are 2 possible ways forward:
For now, to get something going, I'll go with option 1 -- something like this is needed in any case for handling non-JavaScript-initiated requests (e.g., full page loads). #5 (EDIT: originally given below) describes a straightforward approach that could be used to implement option 2 later.