mysociety / whatdotheyknow-theme

The Alaveteli theme for WhatDoTheyKnow (UK)
http://www.whatdotheyknow.com/
MIT License
31 stars 26 forks source link

Add conducting relevant and proportionate searches of WhatDoTheyKnow.com to the SAR Process Document #1061

Closed RichardTaylor closed 1 year ago

RichardTaylor commented 2 years ago

The subject access request process document as linked from

https://wdtkwiki.mysociety.org/wiki/Subject_Access_Requests

currently prompts to do a search for the site's user database (presumably using the email address of the data subject as a key).

It doesn't prompt to do any searches of request/response/correspondence etc. content on the site.

I just processed a SAR request where, given the context, I did search the site, both via Google and via a site search.

We have an approach that policy is only a guide and we always consider the specifics of a particular case, so we could leave that policy to catch searches of the site content.

mdeuk commented 2 years ago

I just processed a SAR request where, given the context, I did search the site, both via Google and via a site search.

We generally shouldn't need to use a third party service to tell us what data we hold.

In general terms, was there an issue related to the site search engine, or just the feeling that a different search system might produce different data?

RichardTaylor commented 2 years ago

The internal site search is generally poor https://github.com/mysociety/alaveteli/issues/1179

Also Google does a more in-depth search,. for example I understand it searches PDFs based on optical character recognition.

On a related point neither the site search or a Google search would capture the content of embargoed requests. I suspect developer assistance would be required to search embargoed requests. That may be disproportionate for subject access requests where we have no reason to believe relevant information is contained within embargoed material.

mdeuk commented 2 years ago

The internal site search is generally poor mysociety/alaveteli#1179

Apologies, that was actually the intent of my comment - if we have to use a third party service to identify data we hold, then that does highlight a potential problem.

Also Google does a more in-depth search,. for example I understand it searches PDFs based on optical character recognition.

I seem to recall that this is based on Tesseract. I wonder if we could use similar functionality, perhaps via Xapian Omega to help us better understand some of the metadata we have. There are other tools, such as Apache Tika, but this would add dependencies.

There would, of course, be other benefits to a better analysis of metadata - which could include being more readily able to identify open data via algorithms, and potentially identify 'problems' using machine learning.

On a related point neither the site search or a Google search would capture the content of embargoed requests. I suspect developer assistance would be required to search embargoed requests. That may be disproportionate for subject access requests where we have no reason to believe relevant information is contained within embargoed material.

This is, indeed, a potential issue that we'd need to consider on a case by case basis.

HelenWDTK commented 1 year ago

Closing because we don't routinely search all request/response/correspondence as that is already publicly available to the data subject.