Closed RichardTaylor closed 1 year ago
I just processed a SAR request where, given the context, I did search the site, both via Google and via a site search.
We generally shouldn't need to use a third party service to tell us what data we hold.
In general terms, was there an issue related to the site search engine, or just the feeling that a different search system might produce different data?
The internal site search is generally poor https://github.com/mysociety/alaveteli/issues/1179
Also Google does a more in-depth search,. for example I understand it searches PDFs based on optical character recognition.
On a related point neither the site search or a Google search would capture the content of embargoed requests. I suspect developer assistance would be required to search embargoed requests. That may be disproportionate for subject access requests where we have no reason to believe relevant information is contained within embargoed material.
The internal site search is generally poor mysociety/alaveteli#1179
Apologies, that was actually the intent of my comment - if we have to use a third party service to identify data we hold, then that does highlight a potential problem.
Also Google does a more in-depth search,. for example I understand it searches PDFs based on optical character recognition.
I seem to recall that this is based on Tesseract. I wonder if we could use similar functionality, perhaps via Xapian Omega to help us better understand some of the metadata we have. There are other tools, such as Apache Tika, but this would add dependencies.
There would, of course, be other benefits to a better analysis of metadata - which could include being more readily able to identify open data via algorithms, and potentially identify 'problems' using machine learning.
On a related point neither the site search or a Google search would capture the content of embargoed requests. I suspect developer assistance would be required to search embargoed requests. That may be disproportionate for subject access requests where we have no reason to believe relevant information is contained within embargoed material.
This is, indeed, a potential issue that we'd need to consider on a case by case basis.
Closing because we don't routinely search all request/response/correspondence as that is already publicly available to the data subject.
The subject access request process document as linked from
https://wdtkwiki.mysociety.org/wiki/Subject_Access_Requests
currently prompts to do a search for the site's user database (presumably using the email address of the data subject as a key).
It doesn't prompt to do any searches of request/response/correspondence etc. content on the site.
I just processed a SAR request where, given the context, I did search the site, both via Google and via a site search.
We have an approach that policy is only a guide and we always consider the specifics of a particular case, so we could leave that policy to catch searches of the site content.