The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. Some preliminary insights already uncovered from this data are illustrated in this blog post. Ongoing analyses can be tracked here
The crawl data hosted here was collected using OpenWPM, which is developed and maintained by the Mozilla Security Engineering team.
yyyy_mm_username__short-title
- the analyses directory contains examples already if this is not clear.Each of the links below links to a bz2 zipped portion of the total dataset.
A small sample of the data is available in safe_dataset.sample.tar.bz2
to get a feel for the content without committing to the full download.
Three samples that are large enough to meaningful analysis of the dataset are also available as the full dataset is very large. More details about the samples are available in data_prep/Sample Review.ipynb
The full dataset. Unzipped the full parquet data will be approximately 70GB. Each (compressed) chunk dataset is around 9GB. SHA256SUMS
contains the checksums for all datasets including the sample.
Refer hello_world.ipynb to load and have a quick look at the data with pandas, dask and spark.
Make contributions with respect to any of your learnings, be it by reading related research papers or through your interaction with the community on gitter and submitting a Pull Request (PR) to the repository. You can submit the PR to the README on the main page or the analysis folder README.
This is not a one issue per person repo. All the questions are very open ended and different people may find very different and complementary things when looking at a question.
Use a reaction emoji to acknowledge a comment rather than writing a comment like "sure" - helps to keep things clean - but the contributor can still let folks know that they saw a comment.
You can ask for help and discuss your ideas on gitter. Click here to join !
When you open an issue and work on a Pull Request relating to the same, add "WIP" in the title of the PR. "WIP" is work in progress. When your PR is ready for review remove the WIP tag. You can also request feedback on specific things while it's still a WIP.
Please reference your issues on a PR so that they link and autoclose. Refer to this
If your OS is Ubuntu and you have trouble installing spark with conda. Refer to this link.
The dataset is very large. Even the subsets of the dataset are unlikely to fit into memory. Working with this dataset will typically require using Dask (http://dask.pydata.org/), Spark (http://spark.apache.org/) or similar tools to enable parallelized / out-of-core / distributed processing.
Please refer the reading list for additional references and information.
This is a great tutorial to learn Pandas.
Tutorial on Jupyter Notebook.
We have used dask in some of our Jupyter notebooks. Dask gives you a pandas-like API but lets you work on data that is too big to fit in memory. Dask can be used on a single machine or a cluster. Most analyses done for this project were done on a single machine. Please start by reviewing the docs to learn more about it.
This will help you get started with GIT. For visual thinkers this tutorial can be a good start.
Other Dask resources: overview video and cheatsheet.
Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. We use findspark to set up spark. You can learn more about it here