Goal

Identify what are the "huge" values. But how much is "huge"? Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.

Approach

Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.

First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.

I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.

I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.

Filtering

Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.

Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.

I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.

Data Wrangling

At this point I was left with 220 groups, categorized by 'domain' and with new data for each:

standard deviation of value_len;
mean value_len on the group;
min value_len found on that group;
max value_len found on that group;
count of how many occurencies make that group;

With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.

Top 10

value_domain	mean	std	min	max	count
{"ScribeTransport".	4128.59	1406.46	2001	7211	93409
{"ins-today-sId".	5037.69	14446.52	2002	87748	60426
{"criteo_pt_cdb_metrics_expires".	9529.66	53326.72	2003	692032	47543
font-face{font-family.	162363.28	172503.75	2634	648067	45059
{"CLOUDFLARE.	514484.07	634151.12	4356	3253324	42660
{"__qubitUACategorisation".	64927.71	105887.48	2018	368966	40003
Na9BL8mAQgqyMAy1zxOlJg$0.	2236.68	178.84	2001	3312	37945
935971.	3726.06	396.41	3248	4695	33010
{"insdrSV".	4026.30	12823.05	2002	191041	32981
834540.	2218.71	216.20	2001	2864	32117

Cloudflare

The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.

I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.

For example:

And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.

Next steps

I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.

aliamcami commented 5 years ago

My (personal) Limitations

I had some (lots of) limitations and difficulties on dealing with this large data set.

First of, I'm a complete begginer on data analysis and I had absolute not idea how to deal with it (I had to study a lot and I'm pretty sure I'm doing some weird and stupid things, but how else can I learn right?)
Second, the computer I have right now could not handle that big data, even the samples were too big.
Third (the solution), I brought a virtual machine from Vultr but for money limitations I could not get one "that" good, just good enough to do some basic stuff. The machine spec is 2 vcpu, 4GB ran, 80 SSD.

birdsarah commented 5 years ago

Review done in chat.

mozilla / overscripted

[WIP] Issue #22 - Data Wrangling #77