nestauk / dsp_waifinder

This interactive map shows entities operating in the AI industry in the UK. Made in collaboration with UKRI.
https://waifinder.iuk.ktn-uk.org/
MIT License
4 stars 0 forks source link

Incorrect place #155

Closed mindrones closed 2 years ago

mindrones commented 2 years ago

As of now, seaching for Edinburgh we can see University of St Andrews which seems to have been tagged in the wrong place.

org:

"name": "University of St Andrews",
"place_id": "b688061d53ead1d46b7c38bdeef82ece38e48cfe",

place:

"b688061d53ead1d46b7c38bdeef82ece38e48cfe": {
      "centroid": {
          "lat": 55.95170131666667,
          "lon": -3.19572045
      },
      "id": "b688061d53ead1d46b7c38bdeef82ece38e48cfe",
      "name": "Edinburgh",
      "region_id": "UKM75",
      "region_name": "Edinburgh, City of",
      "type": "city"
},
Screenshot 2022-07-13 at 18 32 28 Screenshot 2022-07-13 at 18 32 41
lizgzil commented 2 years ago

This is an issue straight from the crunchbase data Screenshot 2022-07-25 at 11 36 22

Our pipeline for the GtR data is:

  1. Get GtR org names
  2. Match these names to the crunchbase data to get City (25% get this added) and Postcode (17% get this added)
  3. For the remaining (majority of) orgs with no City or Postcode found, use the pgeocode and geopy packages to query lat/long -> City

So the question is - do we want to distrust all crunchbase cities and get them via the pgeocode and geopy packages instead? Or is this a one-off anomaly?

mindrones commented 2 years ago

@lizgzil this one seems to have correct coordinates but wrong place:

Org:

{
    ...
    "location": { "lat": 54.979811, "lon": -1.6149, "postcode": "NE17RH" },
    "name": "Newcastle University",
    "place_id": "471aa94f1ff710bd9215bed03da9568ce4111446",
    ...
},

Place:

    {
      "centroid": { "lat": 51.51877671281464, "lon": -0.12040122025171625 },
      "id": "471aa94f1ff710bd9215bed03da9568ce4111446",
      "name": "London",
      "region_id": "UKI31",
      "region_name": "Camden and City of London",
      "type": "city"
    },
Screenshot 2022-08-03 at 17 40 02

Selecting London orgs will select it too:

Screenshot 2022-08-03 at 17 51 56
lizgzil commented 2 years ago

@mindrones - strange! will look into it

mindrones commented 2 years ago

@lizgzil here's another one, tagged as Milton Keys (which seems correct by looking at their website) but apparently located near Gloucester

Screenshot 2022-08-09 at 19 12 38
lizgzil commented 2 years ago

The Newcastle one is also due to Crunchbase having the wrong city Screenshot 2022-08-10 at 09 49 57

lizgzil commented 2 years ago

And the last one is because GtR has two options for 'transport systems catapult', one of which is milton keynes and the other is this gloucestershire lat/long. When we look for the city and postcode in crunchbase, crunchbase just has the milton keynes details Screenshot 2022-08-10 at 09 57 35 Screenshot 2022-08-10 at 09 59 40

lizgzil commented 2 years ago

Other weird ones I found:

  1. National Oceanography Centre: southhampton. so lat/long is correct. However there is also one in Liverpool. GtR has multiple options, but only the southampton one links to a location. City of liverpool one was found in the crunchbase data.
  2. Manufacturing Technology Centre lat long is in nottingham. but CB says liverpool.
  3. Satellite Applications Catapult near oxford. cd says swindon.
  4. University of Central Lancashire says burnley in CB
  5. Ulster University says Coleraine in CB

Two issues going on:

  1. Linking the GtR dataset to CB to get the city and postcode doesn't always work well since the CB data has errors.
  2. Sometimes organisations genuinely have more than one location, e.g. 'transport systems catapult'

Solutions:

  1. Since linking to CB doesn't even supplement the data that much, let's scrap this step, and simply query lat/long -> city, and perhaps lat/long-> postcode if we can.
  2. TBC (I'll investigate how much this happens first)
lizgzil commented 2 years ago

update on that 'transport systems catapult' organisation - actually all the AI projects are assigned to the cheltenham location (not the milton keynes one), so it wasnt that it just picked the first one, it was that these projects genuinely happened in cheltenham. It looks like this company perhaps used to be based there https://opengovuk.com/company/08041919

lizgzil commented 2 years ago

The only GtR organisation which makes it through the filtering process, which has multiple lat/long for the org name is Canterbury Christ Church University

Canterbury Christ Church University 51.276337 1.084818 United Kingdom
Canterbury Christ Church University 51.279643 1.089364 United Kingdom
lizgzil commented 2 years ago

The new version of the data is correct as it can be for these 3 organisations. Transport Systems Catapult is in Cheltenham, which isn't ideal (since googling shows its not there anymore), but what can we do with this other than hard code it? There isn't a bug in the data pipeline, the source just needs an update.

I perform three tests to look for other bugs.

1. NSPL

Check how far the lat/long is from the lat/long found using the postcode in the nspl lookup. (sometimes the lat/long was found using nspl anyway, so this only tests the time it wasn't). The two longest distances are between (52.758443, -1.248217) and (52.764828, -1.22952), and (51.441434, -0.950085) and (51.457625, -0.945636), both 5-10 mins drive. So I feel quite confident lat/long and postcodes match up.

2. geographic_data

Check how far apart the lat/long is from the lat/long found for this city using the geographic_data sql table. 32 were over a distance of 30min-1 hour drive. Looking into some of these, sometimes they are because the place is quite large like "Hampshire" but sometimes its because the place name isn't unique, e.g "Milton" - 'geographic_data' has this as a place in scotland, but our data has companies listed as milton but they are from Miltons near oxford and bristol.

In terms of users interacting, if they look for 'Milton' I guess we'd want to show any data from places called Milton, even though they are far apart.

Place names where the organisation lat/long is far from the place's lat/long: ['Milton', 'Burton', 'Hampshire', 'Gillingham', 'Chesham', 'Oxfordshire', 'Hilton', 'Chilton', 'Lightwater', 'Basildon', 'Stretton', 'Nottinghamshire', 'Bilston', 'Syston', 'Norton','Kent', 'Oakley', 'Melton', 'Ringstead', 'Rhondda']

3. Bing API

I queried the Bing API using the postcode and search for the city name in the Bing output. 93% of the time the city name was somewhere in this output (an exact match). Looking at the ones which don't match, they were really close, e.g.

Lots of these places that didn't match were quite small (I hadn't heard of them), although there were quite a few times we said "London" but bing said "'W1G 7AJ, West End, Hertfordshire", but Im pretty sure London is correct here.

lizgzil commented 2 years ago

closed in #149