Identify what are the "huge" values. But how much is "huge"?
Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.
Approach
Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.
First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.
I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.
I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.
Filtering
Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.
Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.
I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.
Data Wrangling
At this point I was left with 220 groups, categorized by 'domain' and with new data for each:
standard deviation of value_len;
mean value_len on the group;
min value_len found on that group;
max value_len found on that group;
count of how many occurencies make that group;
With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.
Top 10
value_domain
mean
std
min
max
count
{"ScribeTransport".
4128.59
1406.46
2001
7211
93409
{"ins-today-sId".
5037.69
14446.52
2002
87748
60426
{"criteo_pt_cdb_metrics_expires".
9529.66
53326.72
2003
692032
47543
font-face{font-family.
162363.28
172503.75
2634
648067
45059
{"CLOUDFLARE.
514484.07
634151.12
4356
3253324
42660
{"__qubitUACategorisation".
64927.71
105887.48
2018
368966
40003
Na9BL8mAQgqyMAy1zxOlJg$0.
2236.68
178.84
2001
3312
37945
935971.
3726.06
396.41
3248
4695
33010
{"insdrSV".
4026.30
12823.05
2002
191041
32981
834540.
2218.71
216.20
2001
2864
32117
Cloudflare
The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.
I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.
And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.
Next steps
I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.
I had some (lots of) limitations and difficulties on dealing with this large data set.
First of, I'm a complete begginer on data analysis and I had absolute not idea how to deal with it (I had to study a lot and I'm pretty sure I'm doing some weird and stupid things, but how else can I learn right?)
Second, the computer I have right now could not handle that big data, even the samples were too big.
Third (the solution), I brought a virtual machine from Vultr but for money limitations I could not get one "that" good, just good enough to do some basic stuff. The machine spec is 2 vcpu, 4GB ran, 80 SSD.
Pull request related to issue #22
Goal
Identify what are the "huge" values. But how much is "huge"? Since I cannot exactly answer that, my solution was to find the most common and bigger value and start analysing from there.
Approach
Given my machine and knowledge limitations I had to find a way to drastically reduce the data into something I could actually work on.
First thing I did was to have a look at biggest values on the raw data, I had quite a bit of trouble getting this for reasons of memory and processing power, but I did.
I noticed that even if the whole value was different, the beginning of it was usually something common, like the name of the service or something like this.
I decided to use this "first" word as key word to group by, and I named it "domain", and by doing so I now could count the occurrences to see witch ones are the most common and analyse that.
Filtering
Since the problem is the big values, the smaller ones are not interesting here so we can filter them out.
Even before the group by, I had every single row that has the length of 2000 or less character filtered out. That reduced quite a bit of the data I had to work with the group by step.
I was still left with too many rows after the group by, so I decided that no "domain" that did not have at least one occurency of at least 5000 characters was not worth my time for this specific task, and filtered that out as well.
Data Wrangling
At this point I was left with 220 groups, categorized by 'domain' and with new data for each:
With this new data I could sort out by count and see the domain that had the most occurrencies and since the smaller values were already sorted out, the top ones are the ones that interest me most for this specific task.
Top 10
Cloudflare
The one that I found most interesting was the 'cloudflare' group because out of the top 10, this one is the one that has the bigger min and max, by far.
I decided to take a look at the raw value for the biggest "clourflare" apperances and I identified that they are JSON's for (probably) some kind of configuration, or just organized scraped data. For example, most have some kind of font definition, or google analytics script.
For example:
And many more, the "value" field also contains the response for each of these requests, which is probably why is so big.
Next steps
I think would be good to verify what are the other top ones. But from a very initial and superficial look I'm guessing that they are also some sort of configuration json.