mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
75 stars 53 forks source link

Overscripted Web: Data Analysis in the Open

The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. Some preliminary insights already uncovered from this data are illustrated in this blog post. Ongoing analyses can be tracked here

The crawl data hosted here was collected using OpenWPM, which is developed and maintained by the Mozilla Security Engineering team.

Submitting an analysis:

Accessing the Data

Each of the links below links to a bz2 zipped portion of the total dataset.

A small sample of the data is available in safe_dataset.sample.tar.bz2 to get a feel for the content without committing to the full download.

Three samples that are large enough to meaningful analysis of the dataset are also available as the full dataset is very large. More details about the samples are available in data_prep/Sample Review.ipynb

The full dataset. Unzipped the full parquet data will be approximately 70GB. Each (compressed) chunk dataset is around 9GB. SHA256SUMS contains the checksums for all datasets including the sample.

Refer hello_world.ipynb to load and have a quick look at the data with pandas, dask and spark.

New Contributor Tips

Glossary

Resources