mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Prevent 'calculated homepage' from being generated for certain domains #6434

Closed mdeuk closed 1 year ago

mdeuk commented 3 years ago

Originally posted by @RichardTaylor in https://github.com/mysociety/alaveteli/issues/427#issuecomment-892806347

On a very closely related point we should stop calculated home pages eg. googlemail.com

Do we need a list of exceptions ?

This could be assembled from a list of the most common domains used in request addresses, presumably after excluding .ac.uk/.nhs.uk/gov.uk then hotmail / aol / gmail / outlook would come top and we could treat the latter specially?

We did some rough statistics on this issue on https://github.com/mysociety/whatdotheyknow-theme/issues/690 previously - it is certainly causing a data quality issue in the UK, where we have a rather surprising number of public bodies which rely on either free or ISP provided email addresses!

Naturally, we should be surprised that our public bodies are conducting official business using free(mium) email products which are unlikely to meet the standards required by public records legislation, but at the same time, we should ensure our software doesn't unintentionally mislead people by sending them in the wrong direction.

This data quality issue creates issues for re-users of our data, such as the use case which prompted the WDTK ticket - e.g. our data being used to populate a Wikidata dataset; and given the pervasion of non-official email addresses for public bodies is unlikely to be a UK specific 'feature', I'd suggest it creates issues for re-users of data on other Alavateli websites as well.

garethrees commented 3 years ago

our data being used to populate a Wikidata dataset

Linking to https://github.com/mysociety/alaveteli/issues/6535.

garethrees commented 3 years ago

Some feedback from a former colleague:

I pointed a colleague at the all-authorities.csv file on WDTK (linked from https://www.whatdotheyknow.com/help/api; takes a while to download), and they noticed that there are a lot of repeated and obviously-not-right homepages for the authorities listed. e.g. Wigton Town Council (https://www.whatdotheyknow.com/body/wigton_town_council) is listed as having a homepage of http://btinternet.com/.

RichardTaylor commented 3 years ago

Perhaps calculated homepages should be excluded from the CSV download?

If a calculated homepage is in the CSV and the CSV gets downloaded, and re-uploaded, the calculated hompage would become a normal entry in the homepage field.

Also when this issue is fixed removal of any homepages set to eg. gmail.com btinternet.com etc. should be done. There might not be many / any manually set to such domains, if there are any they probably arose via the spreadsheet download/upload mechanism described in the previous paragraph.

garethrees commented 2 years ago

I think we should only calculate a homepage for a default value in the form, which could then be deleted from the field should it not look sensible, rather than the current system that dynamically generates it.

garethrees commented 2 years ago

We now list over 8000 parish councils on WDTK. Many (1000s) have gmail, hotmail, outlook, btinternet, etc email addresses.