Closed mindrones closed 1 year ago
The organisations included in the AI map comes from 3 sources; organisations researching AI come from Gateway to Research, organisations that fund AI companies come from Crunchbase, and AI companies and incubators comes from a proprietary dataset from Glass AI.
The Gateway to Research (GtR) data comes via Nesta's SQL database. Where possible, it is also supplemented by urls and organisation descriptions from the Crunchbase dataset (fields which are not available through GtR).
The first step is searching for projects with certain topic tags which we felt were relevant to AI, e.g "Image & Vision Computing", "Robotics & Autonomy" and "Artificial Intelligence". The complete set of organisations these projects happened at are then found. We then filter these organisations with the following criteria:
This leaves us with research organisations which are large, relevant, and recent.
To supplement this data with urls and organisation descriptions, we query the Crunchbase dataset. If a Crunchbase organisation name is within a GtR organisation name, then we will add some of the Crunchbase data - namely the organisation description and the url. This is only possible for about 80% of the data points.
We query the Crunchbase database via Nesta's SQL server. This data is used to find the investors of AI organisations.
We first find the organisations which are tagged with topics we felt were relevant to AI (e.g. "artificial intelligence", "augmented reality", "autonomous vehicles"). We then find all investors of these organisations, where each investor may have funded multiple AI organisations, and each AI organisation may have been funded by multiple invetsors. Thus, for each investor we have:
We get the lat/long data (which Crunchbase doesn't have) for these investors using the NSPL postcode look up.
We filter this data to only include key AI investors with the following criteria:
Our data for companies and incubator / accelerators comes from Glass AI. Through a process of scraping companies websites and searching for AI related keywords in the company descriptions, Glass AI provided us with a list of organisations.
If a company is also an incubator / accelerators then this is tagged as such in a 'is_incubator' field.
We get the lat/long data (which GlassAI didn't provide us with) for these companies using the NSPL postcode look up.
The only filtering needed for this dataset was:
Dataset inputs:
Dataset | Tag in final output | Has long/lat | Has city | Has postcode | Has description | Has link | Number of organisations |
---|---|---|---|---|---|---|---|
Gateway to Research | University / RTO | Yes | 30% do | Yes | 78% do (via Crunchbase) | 78% do (via Crunchbase) | 158 |
Crunchbase | Funder | Yes (via NSPL postcode lookup) | Yes | Yes | Yes | Yes | 329 |
GlassAI | 'Company' and 'Incubator / accelerators' | Yes (via NSPL postcode lookup) | None | Yes | Yes | Yes | 2558 |
Merged dataset outputs:
Company | Funder | Incubator / accelerator | University / RTO | Number of data points |
---|---|---|---|---|
x | 2484 | |||
x | 319 | |||
x | 151 | |||
x | x | 65 | ||
x | x | x | 8 | |
x | x | 1 | ||
x | x | 1 | ||
Total | 3029 |
The three filtered datasets are concatenated together, then organisation names were cleaned in order to merge together organisations that might have been in more than one of the original datasets. For example the company CodeBase is in both the GlassAI and Crunchbase datasets.
If there is duplication we decide which rows to drop to include based of the criteria (useful if there are conflicting Links or Lat/Long values):
We add the 'Place' field to any data points that don't have it (which is 70% of GtR data and all of the GlassAI data) by using the postcode or lat/long data. We do this using a two methods:
Some cleaning of the place name fields is also included (e.g. converting "London Borough of Camden" to "London"). For each unique place name we find we add NUTS data using the nuts-finder python package and calculate the average lat/long coordinate from all the organisations from this place.
We need to split
/methodology
into subpages as we did for/feedback
and/info
.Actual content will be provided by Liz and Sam in separate PRs