aliamcami commented 5 years ago

The question that originated this analyse is: "Are all big values valid JSON?"

Overview

All the greatest values are JSON, but they represent very little percentage of the whole data.

Most of the data have small value_len

(mean = 1356 for the 10% sample)

95,58% of the data have value_len smaller than the mean
4,42% are bigger than the mean
9.35% are valid JSON

Values above the mean:

61,54% are NOT valid JSON
38,46% are valid JSON

Values that are 1 standard deviation (std) above the mean

(std = 26310 for 10% sample):

0,11% are NOT valid JSON
99,88% are valid JSON
The bigger the value the greater the chance of being a valid JSON

Values 4 std above the mean

100% are valid JSON
The biggest non-JSON value have the length of 104653

The top 46745 gratest value_len are valid JSONs, that is 9.35% of the filtered sample (value_len > mean) and 0,41% of the original 10% sample.

aliamcami commented 5 years ago

I was questioned by @birdsarah:

"what are your next questions? i'm keen to see from you what questions this work has thrown up for you - are there groups / themes to these questions? if you were concerned about tracking / privacy what would you look at next?"

So, I organized some of my questioning in groups/themes and what I got is the following:

About JSONs:

The JSON values are always from the same location or related domains?
Are there a set of location domains that always produces a JSON?
Does the JSON values follow a structure pattern? What pattern?
What data does the JSON hold? Is there any pattern on content?
Do they have nested JSON? Css? Html? Javascript? Recursive study on JSON properties.
Is a JSON's structure for a single script_url domain always the same?
Is every JSON with the same structure produced by the same script_url domain?

General

I'm think some things here maybe a crawler investigation or just wiki reading, since someone may have already described and explained. I just need to find, read and understand it.

Are there other valid data types like html, css... in the values column or just JSON?
Where does the value comes from? What is it used for?

Smal: value_len < mean

What are the small values?
Does the smaller values have any pattern?
What the majority data type?

Medium: mean < value_len < (mean + std)

How many rows are there in the intersection of “no JSON” and “everything is JSON” ?
What are they? Are they from a specific script_url domain? Or realated domains?

Big: value_len > (mean + std)

What are the big non-JSON values?

Security and data sharing:

Do the value columns have any javascript? nested javascript?
Do the javascripts in the dataset contain known malicious behaviors?
Can they collect data that threatens user's privacy?

if you were concerned about tracking / privacy what would you look at next?

I would love to deeper analyze the javascripts, but that’s a whole other area of knowledge. I think I can study common patterns of privacy intrusion and malicious behavior in javacript and try to correlate with the scripts present in the dataset. A related analysis to what was done in the medium article with cryptocoin mining scripts.

Statistical knowledge / coincidence:

The mean of the original 10% sample is pretty similar to the std of the sample taken after filtering for values above the mean

why?
Is it a coincidence?
Is it always like this?
Is it a statistical pattern?

aliamcami commented 5 years ago

Thank you for your incredible review. I updated to WIP and I'll leave it until the follow is ready:

Study and implement how to best plot the requested graphs
Make a readme with those questions
Cleanup the notebook

About the values hardcoded, I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook, since it does take quite some time for me. Should I have a file with this saved then? Or variables holding the hardcoded value? Or leave it to be calculated every time?

About the follow up questions, should I open a new PR specifically for each of them or increment this one when I start to tackle them?

birdsarah commented 5 years ago

"I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook"

I understand. There are trade-offs.

In the case you continue hard coding, you can still reduce - there's perhaps one or two locations where you need to set that value. Places where you are just writing text to document your result, use string formatting to print out the text you want to say with the value embedded.

The downside of hard coding is that when you get new data you need to remember to update those values and people coming new to run your code may not know where the number came from (which data, which field)

I would suggest it's better to not hard code, but to save a derived dataset e.g. the data with only values greater than the mean. Then you can start again from that point half way down your code with those values. And yes, as you said, if it's necessary perhaps to save a file with the values stored. That way you can run the notebook and repopulate from fresh data easily. There are lots of judgement calls here and no right answers just thinking through trade-offs of maintainability, and readability. While you don't generally check in data. I think small datasets (e.g. the means) can be checked in (and I often do this).

Hope this helps. Again, no right answers here. Just craft.

On March 26, 2019 9:32:47 PM CDT, Camila Oliveira notifications@github.com wrote:

Thank you for your incredible review. I updated to WIP and I'll leave it until the follow is ready:

Study and implement how to best plot the requested graphs

Make a readme with those questions

Cleanup the notebook

About the values hardcoded, I actually left them hardcoded to eliminate the need to recalculate them every time I started the notebook, since it does take quite some time for me. Should I have a file with this saved then? Or variables holding the hardcoded value? Or leave it to be calculated every time?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/mozilla/overscripted/pull/81#issuecomment-476941605

aliamcami commented 5 years ago

@birdsarah I have included the following:

Graphs for visualisation for the previous analysis
- (notebook updated to: "isJson_Quantitative_Comparasion.ipynb")
Research/analysis on how the location domain correlates to the value column
- (new notebook named: "isJson_correlation_domain_and_value.ipynb")
Readme with the future questions and overview update
Notebook cleanup

birdsarah commented 5 years ago

I still want to see the plot x axis = value_len (really this value len or lower) and y axis = % valid json

This would ideally be across all values not just above the mean.

aliamcami commented 5 years ago

Thank you for the amazing review, I have a better idea what to do (and how). Thank you!

mozilla / overscripted

Issue 22: huge values are valid JSON #81

Overview

Most of the data have small value_len

Values above the mean:

Values that are 1 standard deviation (std) above the mean

Values 4 std above the mean

About JSONs:

General

Smal: value_len < mean

Medium: mean < value_len < (mean + std)

Big: value_len > (mean + std)

Security and data sharing:

Statistical knowledge / coincidence: