mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

Dataset Description #76

Open ShilpaSangappa opened 5 years ago

ShilpaSangappa commented 5 years ago

We need a section describing each column of the dataset. Even a single line description for each field would be very helpful for somebody who is starting to work with the dataset. Most of the multivariate datasets have descriptions of each field.

e.g.: https://archive.ics.uci.edu/ml/datasets/cardiotocography# Here, the section "Attribute Information" describes each attribute/column

Overscripted dataset Attributes: ['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments', 'arguments_n_keys', 'call_id', 'call_stack', 'file_name', 'func_name', 'in_crawl_list', 'in_iframe', 'in_stripped_crawl_list', 'location', 'locations_len', 'operation', 'script_url', 'symbol', 'time_stamp', 'value_1000', 'value_len']

Descriptions for the above attributes of dataset needs to be added.

birdsarah commented 5 years ago

There are field descriptions for the raw data in the schema here: https://github.com/mozilla/overscripted/blob/master/data_prep/raw_data_schema.template

There are the additional fields in_crawl_list, in_stripped_crawl_list - this is because I processed this data from my own copy of the data. These fields can be ignored for now.

The other additional fields could be documented in a README in the data_prep folder if you want to.

14Richa commented 5 years ago

An example to help understand this more: Below is an actual row from one of the parquet files. I have removed the columns which had no-value or were redundant.

location operation script_col script_line script_url symbol time_stamp value
https://www.syracuse.edu/about/ get 1402 95 https://www.googletagmanager.com/gtm.js?id=GTM-5FC97GL window.navigator.userAgent 2017-12-16 01:27:46.738 Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0

Now if we go to location, we can actually see the call being made when we look at the html. Finally the value being passed is a browser identification of some sorts. You can see here that value field is actually a common user agent. Hope this helps :)

syracuse

birdsarah commented 5 years ago

Note that you cannot see the "call" being made when you look at the html. You can see the request to load the script.

In the context of this dataset, "call" means individual calls to individual JavaScript APIs that are made by the script, in this case googletagmanager.js

birdsarah commented 5 years ago

It is unfortunate that the medium blog posts confuses this issue. Here's a piece of the discussion from @aSquare14 and i about this on the gitter chat on mar 11

@aSquare14: I was reading the blog post which is mentioned in the Readme. And I have a question.
"Given the set of pages making calls to session replay providers, we also looked into the consistency of SSL usage across these calls. Interestingly, the majority of such calls were made over HTTPS (75.7%), and 49.9% of the pages making these calls were accessed over HTTPS. " What's the difference between the calls being made over HTTPS and accessed over HTTPS ? I'm a little confused.

@birdsarah: .... this sentence is unclear - usually when I'm talking about this dataset the "calls" I'm referring to are JS API calls. Those calls have no relation to http/https - that is just how a script is loaded having looked at the blog post again, in the previous paragraph it says "checked for calls to script URLs" in this context (and for this whole section) calls appears to mean accessing resources.
The medium blog post has a commenting facility, feel free to add a clarifying comment for future readers. This happened because of a variety of authors contributing to the post. But it is a clear change of use of the word calls which were earlier introduced to mean JS calls.

I initially suggested to @aSquare14 to post a comment on medium, but perhaps some clarification in the main README is in order.

birdsarah commented 5 years ago

@mlopatka - should we edit the medium blog post?

mlopatka commented 5 years ago

@birdsarah I've updated the blog post to align our phrasing. Good catch.