leoz0214 / Parkrun-Data-Scraper

Scrape the historical summary table for a particular Parkrun event and output detailed statistics, with the ability to export to various file types.
MIT License
2 stars 0 forks source link
beautifulsoup matplotlib parkrun python python-docx web-scraping

Parkrun Data Scraper

Parkrun is a weekly free timed 5k running event mostly based in the UK and Ireland, but also has various events worldwide.

Previously, various statistics were readily available to see, providing insights on events such as course records and event popularity. Unfortunately, most of this has been removed on the website to apparently promote inclusivity of Parkrun. This has frustrated some advanced or curious runners wishing to investigate these factors before partaking in an event.

Nonetheless, most of the statistics are still deducible by processing the summary table of past events. This simple project takes the HTML of that page, parses it, and generates detailed statistics and graphs accordingly. Notably, the project takes the data a step further than what the website previously displayed with even more metrics.

Also, Parkrun has anti-bot measures in place which this program will circumvent to a reasonable degree. In order to minimise the burden, a semi-manual procedure can be followed which does not involve programmatic requests - virtually immune to anti-bot mechanisms. This is a fallback in case the automated scraping strategy fails.

Anyways, here is a list of features that this program provides:

Requirements and Installation

The program is somewhat compartmentalised so unsatisfied requirements do not automatically mean complete unusability. Note the following:

The program can be run in two ways, either through a EXE file or through Python directly. Both methods will be covered.

EXE

A Windows EXE has been built and is available in the Releases section of this project. Note, this only works for Windows, and unfortunately MacOS users must run the program through Python. Similarly, if there security issues, Python will be the only choice.

  1. Download the EXE from the Releases section of this project.
  2. Run the EXE. If successful, the GUI will launch, ready for use (see next section for guide).

Python

The program has been written purely in Python due to the nature of the program being small-scale simple data scraping and processing.

Anyways, to run through Python, the following must be noted:

Provided the above is satisfied, follow these steps to run through Python:

  1. Download the project folder, ensuring all files are available.
  2. Run the src/main.py file and if successful, the GUI will launch, ready for use (see guide).

More technical users may adapt the code to meet their unique situations, which is permitted.

Usage Guide

Now that the program has been set up, below is the functionality of each section explained. Note, tiny details may be omitted, so feel free to explore the program to find out how it works in greater depth.

URL Input

This is the automated scraping section where you can simply input the event URL and data will be fetched automatically and subsequently parsed. As already mentioned, Chrome must be installed on the computer for this input method to work. If this is not the case, refer to File Input for the alternative method.

The screen should look like this:

  1. Enter the URL of the Parkrun event to analyse. Validation is included to ensure robustness, but there is some leniency in the input. The following inputs are accepted:
  2. Click 'Start', and the data collection will occur followed by stats output if successful. Be patient - this may take 10-20 seconds depending on Parkrun server load.

File Input

In case automated data collection fails due to anti-bot mechanisms, missing Chrome or any other reason, a second undetectable way of getting the data into the program is by loading the web page manually, downloading it, and inputting the downloaded HTML file into the program, which can be handled in the same way as if the HTML were programmatically obtained.

  1. Manually navigate to the event history page for the target event e.g. https://www.parkrun.org.uk/bushy/results/eventhistory/.

  2. Right click on this page and press on 'Save As' as seen below:

  3. The file dialog should appear. Ensure the file save mode is set to 'Web Page, Complete', and then press 'Save'.

  4. Wait until the file has finished downloading, which should take no more than 10-20 seconds.

  5. You will notice a complementary folder is downloaded alongside the main HTML file. Simply ignore or delete this folder. The focus is the HTML file. Click 'Select File' in the program and select this HTML file. Provided the file is unmodified and the instructions have been followed, this should result in the data being parsed and displayed.

Output

Upon successful data parsing using either input method, the output screen will be populated with various data points relating to the event. An example output is shown below:

Data Points

The data points seen in the output screen mean the following:

Event Popularity

These metrics mainly focus on the popularity of the event - whether there are usually few participants or a lot. Some people may prefer quiet events; others may prefer larger events.

Competitive

These stats provide insights on the level of competitiveness of the event, designed for the most serious runners.

General

Most of these metrics can also still be seen on the website so are less interesting but the program captures them anyway for convenience.

Graphs

Whilst the data points are useful, the averages can be misleading because an event's popularity and competitiveness can change over time. For a deeper understanding of these metrics, graphs are provided to illustrate trends in the following metrics:

Hence, click the corresponding buttons in the output screen to open the graph of the selected metric against date. For example, the following displays finishers against date:

Note the strange long line in around 2021 - this is due to COVID where events were paused for around a year. Hence, there is a large time gap between the last event before COVID and the first event after COVID leading to this strange line. This is unfortunately the case for most graphs, simply ignore and focus on the general trend. Conversely, this provides some insight into how COVID has affected these metrics, for example, are there more or less participants post-COVID than pre-COVID?

Data Exportation

The data looks good in the GUI, but is rather limited nonetheless (stuck in the program). Hence, the program allows exportation of this data in either tabular or report form. This allows for external data analysis and data sharing.

Simply click the relevant save button to select the desired output format and set the save file path. Hopefully in a few seconds, the output file will have been successfully created. CSV should be fastest, followed by XLSX, then DOCX, and PDF is likely to be slowest.

Limitations

The program is simple yet effective, with clean code, but nonetheless has limitations - some of which could be fixed with further development, others not so much. Nonetheless, these issues are non-critical such that the program is still generally functional and achieves what it should:

Disclaimer

The program is free to use and the source code can be modified as you wish. However, there is ZERO LIABILITY for damages caused by usage of the program.

IMPORTANTLY, PLEASE USE THE PROGRAM REASONABLY SUCH THAT THERE IS NO SIGNIFICANT BURDEN ON PARKRUN SERVERS. REMEMBER, THE MORE INTENSIVELY YOU USE THIS PROGRAM, THE HIGHER THE RISK OF DETECTION AND SUBSEQUENT BLOCKS.

For licensing information, see the license.