Closed mdeuk closed 1 year ago
our data being used to populate a Wikidata dataset
Linking to https://github.com/mysociety/alaveteli/issues/6535.
Some feedback from a former colleague:
I pointed a colleague at the all-authorities.csv file on WDTK (linked from https://www.whatdotheyknow.com/help/api; takes a while to download), and they noticed that there are a lot of repeated and obviously-not-right homepages for the authorities listed. e.g. Wigton Town Council (https://www.whatdotheyknow.com/body/wigton_town_council) is listed as having a homepage of http://btinternet.com/.
Perhaps calculated homepages should be excluded from the CSV download?
If a calculated homepage is in the CSV and the CSV gets downloaded, and re-uploaded, the calculated hompage would become a normal entry in the homepage field.
Also when this issue is fixed removal of any homepages set to eg. gmail.com btinternet.com etc. should be done. There might not be many / any manually set to such domains, if there are any they probably arose via the spreadsheet download/upload mechanism described in the previous paragraph.
I think we should only calculate a homepage for a default value in the form, which could then be deleted from the field should it not look sensible, rather than the current system that dynamically generates it.
We now list over 8000 parish councils on WDTK. Many (1000s) have gmail, hotmail, outlook, btinternet, etc email addresses.
Originally posted by @RichardTaylor in https://github.com/mysociety/alaveteli/issues/427#issuecomment-892806347
We did some rough statistics on this issue on https://github.com/mysociety/whatdotheyknow-theme/issues/690 previously - it is certainly causing a data quality issue in the UK, where we have a rather surprising number of public bodies which rely on either free or ISP provided email addresses!
Naturally, we should be surprised that our public bodies are conducting official business using free(mium) email products which are unlikely to meet the standards required by public records legislation, but at the same time, we should ensure our software doesn't unintentionally mislead people by sending them in the wrong direction.
This data quality issue creates issues for re-users of our data, such as the use case which prompted the WDTK ticket - e.g. our data being used to populate a Wikidata dataset; and given the pervasion of non-official email addresses for public bodies is unlikely to be a UK specific 'feature', I'd suggest it creates issues for re-users of data on other Alavateli websites as well.