mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
75 stars 53 forks source link

Analysis on #34 #75

Open Soumya0803 opened 5 years ago

Soumya0803 commented 5 years ago

Hi @birdsarah I realized I made a wrong assumption that each row has a unique script and did not consider there is redundancy. I first need to find the count of total unique scripts and the count of unique fingerprintjs scripts, hs-analytics, akam scripts. I should be using value_1000_only dataset that contains all the rows of the dataset, but truncates the value field to only keep the first 1000 characters in a column called value_1000. I will keep in mind to use df.head() instead of df.compute() to keep things readable. I am working on making these changes and adding other information details which is conveyed in the first point. Thanks for reviewing.