nestauk / dsp_waifinder

This interactive map shows entities operating in the AI industry in the UK. Made in collaboration with UKRI.
https://waifinder.iuk.ktn-uk.org/
MIT License
4 stars 0 forks source link

ui v2: add `methodology` pages #182

Closed mindrones closed 1 year ago

mindrones commented 1 year ago

We need to split /methodology into subpages as we did for /feedback and /info.

Actual content will be provided by Liz and Sam in separate PRs

lizgzil commented 1 year ago

Organisation data Methods

The organisations included in the AI map comes from 3 sources; organisations researching AI come from Gateway to Research, organisations that fund AI companies come from Crunchbase, and AI companies and incubators comes from a proprietary dataset from Glass AI.

Gateway to Research

The Gateway to Research (GtR) data comes via Nesta's SQL database. Where possible, it is also supplemented by urls and organisation descriptions from the Crunchbase dataset (fields which are not available through GtR).

The first step is searching for projects with certain topic tags which we felt were relevant to AI, e.g "Image & Vision Computing", "Robotics & Autonomy" and "Artificial Intelligence". The complete set of organisations these projects happened at are then found. We then filter these organisations with the following criteria:

  1. The organisation is in a predefined list of organisations - which is a combination of universities listed by HESA, the list of research institutes in the UKRI eligibility list and a list of research and technology organisations (RTOs) given to us by UKRI.
  2. The organisation received any amount of funding in the last 5 years
  3. The organisation has at least 400 projects OR it has had a total of at least £50 million in funding
  4. The organisation is in the UK
  5. The organisation has longitude/latitude data

This leaves us with research organisations which are large, relevant, and recent.

To supplement this data with urls and organisation descriptions, we query the Crunchbase dataset. If a Crunchbase organisation name is within a GtR organisation name, then we will add some of the Crunchbase data - namely the organisation description and the url. This is only possible for about 80% of the data points.

Crunchbase

We query the Crunchbase database via Nesta's SQL server. This data is used to find the investors of AI organisations.

We first find the organisations which are tagged with topics we felt were relevant to AI (e.g. "artificial intelligence", "augmented reality", "autonomous vehicles"). We then find all investors of these organisations, where each investor may have funded multiple AI organisations, and each AI organisation may have been funded by multiple invetsors. Thus, for each investor we have:

  1. The number of AI organisations they have funded
  2. The number of total organisations they have funded

We get the lat/long data (which Crunchbase doesn't have) for these investors using the NSPL postcode look up.

We filter this data to only include key AI investors with the following criteria:

  1. At least 10% of the organisations they fund are AI organisations
  2. They have funded at least 10 organisations
  3. The investors address is in the UK
  4. The "type" field for this investor is "organisation" (not "person")
  5. The investor has longitude/latitude data

GlassAI

Our data for companies and incubator / accelerators comes from Glass AI. Through a process of scraping companies websites and searching for AI related keywords in the company descriptions, Glass AI provided us with a list of organisations.

If a company is also an incubator / accelerators then this is tagged as such in a 'is_incubator' field.

We get the lat/long data (which GlassAI didn't provide us with) for these companies using the NSPL postcode look up.

The only filtering needed for this dataset was:

  1. The company has longitude/latitude data

Merging datasets

Dataset inputs:

Dataset Tag in final output Has long/lat Has city Has postcode Has description Has link Number of organisations
Gateway to Research University / RTO Yes 30% do Yes 78% do (via Crunchbase) 78% do (via Crunchbase) 158
Crunchbase Funder Yes (via NSPL postcode lookup) Yes Yes Yes Yes 329
GlassAI 'Company' and 'Incubator / accelerators' Yes (via NSPL postcode lookup) None Yes Yes Yes 2558

Merged dataset outputs:

Company Funder Incubator / accelerator University / RTO Number of data points
x 2484
x 319
x 151
x x 65
x x x 8
x x 1
x x 1
Total 3029

The three filtered datasets are concatenated together, then organisation names were cleaned in order to merge together organisations that might have been in more than one of the original datasets. For example the company CodeBase is in both the GlassAI and Crunchbase datasets.

If there is duplication we decide which rows to drop to include based of the criteria (useful if there are conflicting Links or Lat/Long values):

  1. Trust Glass AI first - since several sources were considered to find Links and Lat/Long,
  2. then trust GtR - since Lat/Long was given in this data,
  3. lastly trust Crunchbase

Adding place information

We add the 'Place' field to any data points that don't have it (which is 70% of GtR data and all of the GlassAI data) by using the postcode or lat/long data. We do this using a two methods:

  1. Query the postcode to get the city using the pgeocode python package. We found this datasource to be quite unreliable (e.g. Dulwich came up as the city) and there can be multiple city names given for the same postcode beginning. Thus, we only used it if the city given was in a list of major cities (London, Manchester etc). We keep this step in since it is quite fast, so can be used to quickly get the low hanging fruit.
  2. Query the lat/long coordinates to get the city/town using the geopy python package. This takes longer and provides us with city, town and village names.

Some cleaning of the place name fields is also included (e.g. converting "London Borough of Camden" to "London"). For each unique place name we find we add NUTS data using the nuts-finder python package and calculate the average lat/long coordinate from all the organisations from this place.