mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

[WIP] Issue #22 - Data Wrangling #77

Closed aliamcami closed 5 years ago

aliamcami commented 5 years ago

Pull request related to issue #22

Goal

Identify what are the "huge" values. But how much is "huge"? Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.

Approach

Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.

First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.

I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.

I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.

Filtering

Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.

Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.

I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.

Data Wrangling

At this point I was left with 220 groups, categorized by 'domain' and with new data for each:

With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.

Top 10

value_domain mean std min max count
{"ScribeTransport". 4128.59 1406.46 2001 7211 93409
{"ins-today-sId". 5037.69 14446.52 2002 87748 60426
{"criteo_pt_cdb_metrics_expires". 9529.66 53326.72 2003 692032 47543
font-face{font-family. 162363.28 172503.75 2634 648067 45059
{"CLOUDFLARE. 514484.07 634151.12 4356 3253324 42660
{"__qubitUACategorisation". 64927.71 105887.48 2018 368966 40003
Na9BL8mAQgqyMAy1zxOlJg$0. 2236.68 178.84 2001 3312 37945
935971. 3726.06 396.41 3248 4695 33010
{"insdrSV". 4026.30 12823.05 2002 191041 32981
834540. 2218.71 216.20 2001 2864 32117

Cloudflare

The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.

I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.

For example:

And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.

Next steps

I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.

aliamcami commented 5 years ago

My (personal) Limitations

I had some (lots of) limitations and difficulties on dealing with this large data set.

birdsarah commented 5 years ago

Review done in chat.