openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.34k stars 314 forks source link

Update printing of crawl settings to console #700

Open birdsarah opened 4 years ago

birdsarah commented 4 years ago

I find it extremely hard to read the browser configuration that is dumped to console

Keys:                                                                                                                                                                                                              
{                                                                                                                                                                                                                  
  "crawl_id": 0,                                                                                                                                                                                                   
  "adblock-plus": 1,                                                                                                                                                                                               
  "bot_mitigation": 2,                                                                                                                                                                                             
  "browser": 3,                                                                                                                                                                                                    
  "callstack_instrument": 4,                                                                                                                                                                                       
  "cookie_instrument": 5,                                                                                                                                                                                          
  "disconnect": 6,                                                                                                                                                                                                 
  "display_mode": 7,                                                                                                                                                                                               
  "donottrack": 8,                                                                                                                                                                                                 
  "extension_enabled": 9,                                                                                                                                                                                          
  "ghostery": 10,                                                                                                                                                                                                  
  "http_instrument": 11,                                                                                                                                                                                           
  "https-everywhere": 12,                                                                                                                                                                                          
  "js_instrument": 13,                                                                                                                                                                                             
  "js_instrument_settings": 14,                                                                                                                                                                                    
  "navigation_instrument": 15,                                                                                                                                                                                     
  "prefs": 16,                                                                                                                                                                                                     
  "random_attributes": 17,                                                                                                                                                                                         
  "save_content": 18,                                                                                                                                                                                              
  "tp_cookies": 19,                                                                                                                                                                                                
  "tracking-protection": 20,                                                                                                                                                                                       
  "ublock-origin": 21                                                                                                                                                                                              
}        

Followed by matching numerical keys to the values.

This is exacerbated by the new generic JS instrumentation - so the value for 14 is a large json blob.

Can we do better?

englehardt commented 4 years ago

I totally agree. This was the best I could come up with back when I originally introduced it, but I've hated that list of keys since then.

Do you have any suggestions? I wonder if there are some libraries that can help here.

birdsarah commented 4 years ago

My starting point improvement would just be:

  crawl_id: ABC,                                                                                                                                                                                                   
  adblock-plus: True,                                                                                                                                                                                               
  bot_mitigation: False,                                                                                                                                                                                             
  browser: True,                                                                                                                                                                                                    
  callstack_instrument: True,                                                                                                                                                                                       
  cookie_instrument: True,                                                                                                                                                                                          
  disconnect: True,                                                                                                                                                                                                 
  display_mode: 'headless',                  

Moving beyond that we could do an ascii table

Param                         Value
-------------------------------------
crawl_id                      True
...
callstack_instrument   True
display_mode             'headless'

But I actually think that would be harder to read than the first one where the keys and values are close to each other.

I would suggest separating the listing into some distinct groups e.g. "short" (crawl_id, display_mode, .... things with text), "boolean" (all the true false), and "long" (e.g. the new js_instrumentation_settings which should be rendered seperately so as not to make reading all the other things impossible. (short, boolean and long, are terrible names and I'm not suggesting we use them - just suggesting some conceptual splits.).

Alternatively to the conceptual splits, we could alphabetize and cap the printed length to, say, 100 characters. And for all things that were capped, print below initial key list.

  adblock-plus: True,                                                                                                                                                                                                                                                                                                                                          
  bot_mitigation: False,                                                                                                                                                                                             
  browser: True,                                                                                                                                                                                                    
  callstack_instrument: True,                                                                                                                                                                                       
  cookie_instrument: True,   
  crawl_id: ABC,                                                                                                                                                                                                  
  disconnect: True,                                                                                                                                                                                                 
  display_mode: 'headless',         
  ....
  js_instrumentation_settings: '[{"object": window['ScriptProcessorNode'].prototype, "instrumentedName": "ScriptProcessorNode",  ......Too long to render - see JS_INSTRUMENT_SETTINGS below'.
englehardt commented 4 years ago

This is exacerbated by the new generic JS instrumentation - so the value for 14 is a large json blob.

This was fixed by #733, but it's still just a small improvement on a bad design.