ufc-data

📖 Contents

ℹ️ About

The UFC (Ultimate Fighting Championship) is a global mixed martial arts (MMA) organization, hosting weekly competitive events that showcase fighters from a range of weight classes and backgrounds.

This repository contains code and resources relating to the UFC. This includes one of the most comprehensive public UFC datasets available, encompassing official match outcomes and history compiled from the UFC, fighter statistics, as well as historic betting odds.

The purpose of compiling these datasets is for personal interest for data analysis and to test building a predictive model for match outcome on, as well as being publicly available for external interest.

📁 Datasets

/data/complete_ufc_data.csv captures a comprehensive UFC dataset uniquely combining 30 years of match history (from 1994), individual figher statistics and 9 years of historic betting odds (from Nov 2014).

Data dictionary

| Column | Sample values| Description | Source | | --- | --- | --- | --- | | `event_date` | `2023-09-16` | Date of UFC event | Scraped from UFC match history | | `event_name` | `UFC Fight Night: Grasso vs. Shevchenko 2` | Name of UFC event | Scraped from UFC match history | | `weight_class` | `Women's Flyweight` | Weight class of UFC match | Scraped from UFC match history | | `fighter1`, `fighter2` | `Alexa Grasso`, `Valentina Shevchenko` | Fighter names; note that `fighter1` should usually be the winner of the match, as this is how the names are ordered in the official match history | Scraped from UFC match history| | `favourite`, `underdog` | `Valentina Shevchenko`, `Alexa Grasso`, `NaN` | Fighter names from betting favourite and betting underdogs.

Note that betting odds do not exist for older years, and that where odds do exist, there will be missing values where fighter names on the betting site and official UFC match history did not match | Scraped from historic odds on betmma.tips| | `favourite_odds`, `underdog_odds` | `1.67`, `2.88`, `NaN` | Betting odds (decimal) | Scraped from historic odds on betmma.tips | | `betting_outcome` | `favourite`, `underdog`, `NaN` | Whether the favourite or the underdog was the winner of the match. Provided in this format for easier querying on odds | Scraped from historic odds on betmma.tips | | `outcome` | `fighter`, `fighter2`, `Draw` | Match outcome - will usually be `fighter1` as this is how names are ordered in the official match history | Derived from UFC match history| | `method` | `S-DEC`, `U-DEC`, `KO/TKO Punches`| Method of victory | Scraped from UFC match history | | `round` | `5` | Round of victory | Scraped from UFC match history | | fighter1_*
e.g., `fighter1_height`, `fighter1_dob`, `fighter1_reach`, `fighter1_sig_strikes_landed_pm`, `fighter1_takedown_avg_per15m`| | Fighter attributes for `fighter1` at time data was scraped| Derived from UFC fighter statistics | | fighter2_* | | Fighter attributes for `fighter2` at time data was scraped | Derived from UFC fighter statistics | | `events_extract_ts`, `odds_extract_ts`, `fighter_extract_ts` | `2023-09-21 02:02:55.178363 ` | Timestamp when dataset was scraped | |

The raw datasets (scraped from the official UFC website and betmma.tips are also available under /data/.

⚒️ Data extraction

🏃 Code:

To run web scraper and update match results/fighter stats/betting odds:
```
python -m ufc.scraper
```
Note that the following arguments are permitted:
- --events, --fighters, --odds, to scrape individually/multiple, rather than all. The default is to scrape all.
To run pre-processing, data cleaning, on scraped data:
```
python -m ufc.preprocessing
```

✅ Features completed:

Scrape UFC data - fighter stats, and match results
Scrape historic betting odds from betmma.tips
Pre-processing to clean data, reformat/restructure, data checks

🚧 Feature backlog

Update by appending instead of replacing all, for more efficient refreshes - fetch only new events, but update all fighter stats

📊 EDA / Data viz

Some interesting insights and visualisations are shared here:	Insight	Visualisation
Historic likelihood of victory demonstrates a strong correlation between age, and average strikes landed PM, and success in matches. Fighters with a younger age or superior striking output statistically had a competitive advantage, winning ~60% of matches
The probability of the betting favorite winning historically rises from just above 50% to over 75% when the difference in decimal odds exceeds 2.0. Moreover, this likelihood increases as the delta of odds increases, with ~90% of matches favoring the favorite when the delta exceeds 4.5.

🔮 Predictive model

🚧 Development of ML model to test how well match outcome can be predicted based on fighter stats is WiP:

Initial PoCs (GBM, logistic regression) attempting to predict match outcome from fighter attributes (had not yet scraped betting odds) saw accuracy of ~65%
This is comparable to a betting strategy of always picking the favourite (65%), which suggests that betting market sentiment may capture most information the model is currently trained on.
Significant opportunity still to iterate with further testing of features:
- Fight win streak, finish rate (knockouts, submissions)
- Derived features - durability, tag as wrester/striker/grappler etc.
- Include if fighter is favourite (if have scraped odds)
Note that MMA is a highly dynamic and unpredictable sport, frequently characterised by upsets, and that match outcome may not be consistently predictable

Process	Analysis	Finding	Notebook
Feature selection	Initial GBM testing / feature selection	• Delta (of fighter1 and fighter2) features capture as much signal as individual features • Highest importance features related to delta of striking stats, and surprisingly also difference in age • Lowest importance features were height, reach, stance and weight class. Takedown accuracy was surprisingly less important, compared to other features e.g. takedown attempts • Feature importance (all delta features) • SHAP values (after removing less important delta features by RFECV)	`notebooks\ml experiments\20231012 Initial GBM test.ipynb`
Model selection	Initial GBM testing / feature selection	• Initial tests saw accuracy of 64-65% • Variation in accuracy depending on hyperparameter selection, different parameters across folds - may need tuning	`notebooks\ml experiments\20231012 Initial GBM test.ipynb`

🔧 Setup

Dependency management: Poetry (more actively maintained) or pip (requirements.txt exists but less frequently updated)

poetry install

jansen88 / ufc-data

readme