jansen88 / ufc-data

Comprehensive dataset for UFC data, compiling match history, fighter stats and betting odds
4 stars 0 forks source link
ufc web-scraping

ufc-data 1920px-UFC_Logo svg

📖 Contents

ℹī¸ About

The UFC (Ultimate Fighting Championship) is a global mixed martial arts (MMA) organization, hosting weekly competitive events that showcase fighters from a range of weight classes and backgrounds.

This repository contains code and resources relating to the UFC. This includes one of the most comprehensive public UFC datasets available, encompassing official match outcomes and history compiled from the UFC, fighter statistics, as well as historic betting odds.

The purpose of compiling these datasets is for personal interest for data analysis and to test building a predictive model for match outcome on, as well as being publicly available for external interest.

📁 Datasets

/data/complete_ufc_data.csv captures a comprehensive UFC dataset uniquely combining 30 years of match history (from 1994), individual figher statistics and 9 years of historic betting odds (from Nov 2014).

Data dictionary | Column | Sample values| Description | Source | | --- | --- | --- | --- | | `event_date` | `2023-09-16` | Date of UFC event | Scraped from UFC match history | | `event_name` | `UFC Fight Night: Grasso vs. Shevchenko 2` | Name of UFC event | Scraped from UFC match history | | `weight_class` | `Women's Flyweight` | Weight class of UFC match | Scraped from UFC match history | | `fighter1`, `fighter2` | `Alexa Grasso`, `Valentina Shevchenko` | Fighter names; note that `fighter1` should usually be the winner of the match, as this is how the names are ordered in the official match history | Scraped from UFC match history| | `favourite`, `underdog` | `Valentina Shevchenko`, `Alexa Grasso`, `NaN` | Fighter names from betting favourite and betting underdogs.

Note that betting odds do not exist for older years, and that where odds do exist, there will be missing values where fighter names on the betting site and official UFC match history did not match | Scraped from historic odds on betmma.tips| | `favourite_odds`, `underdog_odds` | `1.67`, `2.88`, `NaN` | Betting odds (decimal) | Scraped from historic odds on betmma.tips | | `betting_outcome` | `favourite`, `underdog`, `NaN` | Whether the favourite or the underdog was the winner of the match. Provided in this format for easier querying on odds | Scraped from historic odds on betmma.tips | | `outcome` | `fighter`, `fighter2`, `Draw` | Match outcome - will usually be `fighter1` as this is how names are ordered in the official match history | Derived from UFC match history| | `method` | `S-DEC`, `U-DEC`, `KO/TKO Punches`| Method of victory | Scraped from UFC match history | | `round` | `5` | Round of victory | Scraped from UFC match history | | fighter1_*
e.g., `fighter1_height`, `fighter1_dob`, `fighter1_reach`, `fighter1_sig_strikes_landed_pm`, `fighter1_takedown_avg_per15m`| | Fighter attributes for `fighter1` at time data was scraped| Derived from UFC fighter statistics | | fighter2_* | | Fighter attributes for `fighter2` at time data was scraped | Derived from UFC fighter statistics | | `events_extract_ts`, `odds_extract_ts`, `fighter_extract_ts` | `2023-09-21 02:02:55.178363 ` | Timestamp when dataset was scraped | |

The raw datasets (scraped from the official UFC website and betmma.tips are also available under /data/.

⚒ī¸ Data extraction

🏃 Code:

✅ Features completed:

🚧 Feature backlog

📊 EDA / Data viz

Some interesting insights and visualisations are shared here: Insight Visualisation
Historic likelihood of victory demonstrates a strong correlation between age, and average strikes landed PM, and success in matches. Fighters with a younger age or superior striking output statistically had a competitive advantage, winning ~60% of matches image
The probability of the betting favorite winning historically rises from just above 50% to over 75% when the difference in decimal odds exceeds 2.0. Moreover, this likelihood increases as the delta of odds increases, with ~90% of matches favoring the favorite when the delta exceeds 4.5. image

🔮 Predictive model

🚧 Development of ML model to test how well match outcome can be predicted based on fighter stats is WiP:

Process Analysis Finding Notebook
Feature selection Initial GBM testing / feature selection â€ĸ Delta (of fighter1 and fighter2) features capture as much signal as individual features
â€ĸ Highest importance features related to delta of striking stats, and surprisingly also difference in age
â€ĸ Lowest importance features were height, reach, stance and weight class. Takedown accuracy was surprisingly less important, compared to other features e.g. takedown attempts
â€ĸ Feature importance (all delta features) image
â€ĸ SHAP values (after removing less important delta features by RFECV) image
notebooks\ml experiments\20231012 Initial GBM test.ipynb
Model selection Initial GBM testing / feature selection â€ĸ Initial tests saw accuracy of 64-65%
â€ĸ Variation in accuracy depending on hyperparameter selection, different parameters across folds - may need tuning
notebooks\ml experiments\20231012 Initial GBM test.ipynb

🔧 Setup

Dependency management: Poetry (more actively maintained) or pip (requirements.txt exists but less frequently updated)

poetry install