openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.33k stars 313 forks source link

Cap the amount of calls logged for each script, frame and tab #345

Open motin opened 5 years ago

motin commented 5 years ago

A cap on script-level protects the data from exploding in size when scripts do frequent calls to many different apis (such as Bokeh which when used in notebooks easily can yield around 10k js log calls per script).

A cap on frame-level protects the data from exploding in size when frames load an unusual amount of scripts that each do frequent calls to instrumented API:s.

A cap on tab-level protects the data from exploding in size when tabs include a large amount of frames that load an unusual amount of scripts that each do frequent calls to instrumented API:s.

Ideally, an event should be emitted when a cap was reached in order to inform analysis.

Example of an notebook that would benefit from script-level capping: https://metrics.mozilla.com/~sbird/overscripted-clustering/clusters/working_assumption_and_refinement/precision%20bundling%20-%20part%201.html

Example of a page that would benefit from tab/frame-level capping: https://www.independent.co.uk

englehardt commented 5 years ago

My preference here would be to include a flag that can disable these caps as they are less important for web crawling.

Do you have suggestions for these limits? The script/api combination cap is very liberal; it is mainly intended to prevent huge logs from scripts that call APIs in a never-ending loop (e.g. some scripts poll document.cookie in a setTimeout loop until a certain cookie shows up)

motin commented 5 years ago

My preference here would be to include a flag that can disable these caps as they are less important for web crawling.

Sure. A cap value of -1 could suggest a disabling of a cap.

Do you have suggestions for these limits?

Not yet, and I am all for keeping the default caps very liberal to begin with. These caps are meant to protect the integrity of the data aggregation pipeline when crawlers / users visit problematic sites.

Fyi, as a workaround in JESTr, we are adding a fail-safe consisting of batching http and js packets by web navigations. Together with a client-enforced 500kb limit on telemetry pings, we are be able to avoid extreme volumes of packets being sent due to certain scripts.