hmrc / eat-out-to-help-out-establishments

Apache License 2.0
17 stars 6 forks source link

Data quality #18

Open exussum12 opened 4 years ago

exussum12 commented 4 years ago

There appears to be some places wih issues with their location

Eg Concept catering c/o Chester golf club,CH2 8AR

Should be CH4 8AR

Bijou by the Sea,AB56 5DJ

Should be AB56 4DJ

Looks to be ~50 places where the postcode is incorrect

barrychocolate commented 4 years ago

I have 96 establishments where i can't match the postcode to the National Statistics Postcode Lookup.

kmpoppe commented 4 years ago

@exussum12 Those entries seem to be typos quite clearly (I always wonder how that can happen in a digital process but oh well 🤷 ) @barrychocolate aren't there establishments that have "nice" postcodes that work with Royal Mail but aren't listed in the NSPL because they wouldn't normally exist?

exussum12 commented 4 years ago

@kmpoppe I agree, wasnt sure if they can be fixed though (sending a PR doesnt seem like it would help as the CSV is likely the output of some other data, rather than a true master source)

Some other missing postcodes actually do seem legit, for example some of the Trafford Center in Manchester from memory had a postcode I could cross reference with google, but didnt exist in the ONS postcode data.

barrychocolate commented 4 years ago

@barrychocolate aren't there establishments that have "nice" postcodes that work with Royal Mail but aren't listed in the NSPL because they wouldn't normally exist?

It seems that way. I will try using OS Codepoint and see if that is any better.

Also, the reason for the typos is that while an address lookup facility is available for those registering for the scheme, the user also has the ability to manual enter or modify an address. I suspect this is the reason for some of the data quality issues we are seeing.

kmpoppe commented 4 years ago

So, I've been fiddling around with the data a lot, there are around 450 establishments with invalid Postcodes as per Codepoint Open. I'll go ahead and aggregate all the stuff we've got here, once #3 gets resolved in a fashion that makes it Crown Copyright or anything I can make my crunching public ;)

barrychocolate commented 4 years ago

I tried Codepoint Open but found the biggest drawback with using Codepoint Open is that it only covers GB. There are 423 establishments with a Northern Irish BT postcode that Codepoint won't match..

When I used the Office of National Statistics Postcode Directory (which includes terminated postcodes which businesses may still use) it has a better match rate with only 98 unmatched. Providing the UPRNs (where they have them) would likely solve some of these unmatched. So that is what i have stuck with for my project.

kmpoppe commented 4 years ago

@barrychocolate I've set up a MongoDB Cloud with the data, would you like to work on that with what you have? Feel free to contact me directly, twitter and telegram, see here :-) Kai