ui v2: add `methodology` pages

Organisation data Methods

The organisations included in the AI map comes from 3 sources; organisations researching AI come from Gateway to Research, organisations that fund AI companies come from Crunchbase, and AI companies and incubators comes from a proprietary dataset from Glass AI.

Gateway to Research

The Gateway to Research (GtR) data comes via Nesta's SQL database. Where possible, it is also supplemented by urls and organisation descriptions from the Crunchbase dataset (fields which are not available through GtR).

The first step is searching for projects with certain topic tags which we felt were relevant to AI, e.g "Image & Vision Computing", "Robotics & Autonomy" and "Artificial Intelligence". The complete set of organisations these projects happened at are then found. We then filter these organisations with the following criteria:

The organisation is in a predefined list of organisations - which is a combination of universities listed by HESA, the list of research institutes in the UKRI eligibility list and a list of research and technology organisations (RTOs) given to us by UKRI.
The organisation received any amount of funding in the last 5 years
The organisation has at least 400 projects OR it has had a total of at least £50 million in funding
The organisation is in the UK
The organisation has longitude/latitude data

This leaves us with research organisations which are large, relevant, and recent.

To supplement this data with urls and organisation descriptions, we query the Crunchbase dataset. If a Crunchbase organisation name is within a GtR organisation name, then we will add some of the Crunchbase data - namely the organisation description and the url. This is only possible for about 80% of the data points.

Crunchbase

We query the Crunchbase database via Nesta's SQL server. This data is used to find the investors of AI organisations.

We first find the organisations which are tagged with topics we felt were relevant to AI (e.g. "artificial intelligence", "augmented reality", "autonomous vehicles"). We then find all investors of these organisations, where each investor may have funded multiple AI organisations, and each AI organisation may have been funded by multiple invetsors. Thus, for each investor we have:

The number of AI organisations they have funded
The number of total organisations they have funded

We get the lat/long data (which Crunchbase doesn't have) for these investors using the NSPL postcode look up.

We filter this data to only include key AI investors with the following criteria:

At least 10% of the organisations they fund are AI organisations
They have funded at least 10 organisations
The investors address is in the UK
The "type" field for this investor is "organisation" (not "person")
The investor has longitude/latitude data

GlassAI

Our data for companies and incubator / accelerators comes from Glass AI. Through a process of scraping companies websites and searching for AI related keywords in the company descriptions, Glass AI provided us with a list of organisations.

If a company is also an incubator / accelerators then this is tagged as such in a 'is_incubator' field.

We get the lat/long data (which GlassAI didn't provide us with) for these companies using the NSPL postcode look up.

The only filtering needed for this dataset was:

The company has longitude/latitude data

Merging datasets

Dataset inputs:

Dataset	Tag in final output	Has long/lat	Has city	Has postcode	Has description	Has link	Number of organisations
Gateway to Research	University / RTO	Yes	30% do	Yes	78% do (via Crunchbase)	78% do (via Crunchbase)	158
Crunchbase	Funder	Yes (via NSPL postcode lookup)	Yes	Yes	Yes	Yes	329
GlassAI	'Company' and 'Incubator / accelerators'	Yes (via NSPL postcode lookup)	None	Yes	Yes	Yes	2558

Merged dataset outputs:

Company	Funder	Incubator / accelerator	University / RTO
x	2484
x	319
x	151
x	x	65
x	x	x	8
x	x	1
x	x	1
Total	3029

The three filtered datasets are concatenated together, then organisation names were cleaned in order to merge together organisations that might have been in more than one of the original datasets. For example the company CodeBase is in both the GlassAI and Crunchbase datasets.

If there is duplication we decide which rows to drop to include based of the criteria (useful if there are conflicting Links or Lat/Long values):

Trust Glass AI first - since several sources were considered to find Links and Lat/Long,
then trust GtR - since Lat/Long was given in this data,
lastly trust Crunchbase

Adding place information

We add the 'Place' field to any data points that don't have it (which is 70% of GtR data and all of the GlassAI data) by using the postcode or lat/long data. We do this using a two methods:

Query the postcode to get the city using the pgeocode python package. We found this datasource to be quite unreliable (e.g. Dulwich came up as the city) and there can be multiple city names given for the same postcode beginning. Thus, we only used it if the city given was in a list of major cities (London, Manchester etc). We keep this step in since it is quite fast, so can be used to quickly get the low hanging fruit.
Query the lat/long coordinates to get the city/town using the geopy python package. This takes longer and provides us with city, town and village names.

Some cleaning of the place name fields is also included (e.g. converting "London Borough of Camden" to "London"). For each unique place name we find we add NUTS data using the nuts-finder python package and calculate the average lat/long coordinate from all the organisations from this place.

nestauk / dsp_waifinder