Consider practicality of generating alerts when Excel files containing hidden data are released

Replaces #2663, which was deleted in error:

RichardTaylor commented on 15 Jul 2015: There have been a number of cases on WhatDoTheyKnow of the accidental release of large amounts of sensitive personal information by public bodies who have not realised information was present, but hidden, in Excel documents released ( Blog post )

The suggestion of a technical approach to assist in identifying such occurrences has been made.

If it's not technically possible to actually detect if hidden data is present, perhaps potentially problematic files, could be flagged. One idea is to alert administrators and the public body concerned when an Excel document over a certain size (by file size, rows, or sheets) is included as an attachment to a response sent to an Alaveteli site. The message to the public body could say something along the lines of:

~"are you sure you meant to send x MB of data in an Excel file in response to [request subject]. There have been instances of public bodies accidentally releasing information in response to requests. Please see our detailed guidance at ... and contact us via xxxx if you are concerned you may have accidentally released more information than intended".

Perhaps the ability to search for large attachments of a certain file type be generally useful anyway?

FOIMonkey commented on 15 Jul 2015: Most breaches occur when the following happens:

Data is not hidden in any way.
Data is on a hidden spreadsheet.
Data is in hidden rows/columns.
Data is cached when the authority creates a graph/chart
Data is cached when the authority creates a pivot table
Data is cached when the authority imports data from an external file.
Xapian already happily looking inside xls and certain metadata is already discoverable by the site search eg pivot tables. It does not cope with xlsx files very well.

The easiest thing to look for is file size. Most incidents involve spreadsheets >1mb (though there have been one or two smaller than that). Of course there are also a lot of legitimate releases that are that size. As of 19 May there were 1152 spreadsheets of 1mb or larger on the site, most of these are large datasets or badly formatted excel files containing several million empty rows. The largest is 40.3mb.

RichardTaylor commented on 16 Jul 2015: Perhaps a good question to ask would be if there are any additional features which @FOIMonkey can identify that would make it easier for them (and others) to find these problematic documents.

FOIMonkey commented on 16 Jul 2015: Number one would be the ability to search via filesize.

skenaja commented on 16 Jul 2015: For xlsx (which is a effectively zipped collection of xml files), these are some of the ways to find out if data has been hidden or cached:

named ranges are defined in the xl\workbook.xml file, in \workbook\definedNames section
pivot table caches are present in xl\pivotCache*
hidden sheets are in xl\workbook.xml file, in \workbook\sheets\sheet, state="hidden" attribute
chart cached data is in xl\charts\chart*.xml in or elements
hidden columns are in xl\worksheets\sheet*.xml in \worksheet\cols\col, hidden="1" attribute
hidden rows are in xl\worksheets\sheet*.xml in \worksheet\sheetdata\rows, hidden="1" attribute

On 7 September 2022, @garethrees wrote: I think programatically identifying these seems feasible, whether that's heuristically by large file size or more by deeper inspections of sheets and cells.

What I don't think we know yet is what we want to happen when we find one.

We could tag the request (or message or attachment, though applying tags at this level doesn't currently do
much) so that we could subscribe to search alerts
We could display some sort of "requester only" UI that prompts them to alert us – or the authority – if there's an
issue
We could have it automatically email us from the app
We could have it automatically email the authority with a "you should double check this" message
Or some combination of these

On 7 September 2022, @FOIMonkey wrote:

We could have it automatically email the authority with a "you should double check this" message

I'd go for this one. It puts the onus back on them to check the responses.

If we were to be alerted, there would be false positives from genuine large releases, but these are often interesting/good site use, so this might be a good thing. I don't think we want to get into anything that's too proactive in terms of checking however. I did do a manual trawl of all the spreadsheets over 1mb in size many years back after @crowbot pulled a list of these off the system for me.

In terms of looking for other technical indicators, the system is able to look inside xls files and adds the hidden data and signs of it to the search index. This doesn't happen with xlsx files because of the way these are structured. csv files are usually fine, and we have so few ods files that we don't really need to worry about those.

On 25 January 2023, @garethrees wrote A couple of things to consider:

Right now I don't think we extract attachments until someone views the request page. This would mean in extreme cases the notification might be days, weeks or perhaps months after the response was sent. I don't know how much of an issue this actually would be in that search indexing would likely trigger extraction, but it's something we'd want to check.
This likely needs #1240 – I don't think we want to extract attachments and then parse excel files in the response ingestion cycle – and definitely not when a user first loads the page with the new response – as that would be quite resource intensive.
Do we need to consider both xls and xlsx? If we're not getting sent many xls files these days then it's probably not worth any extra effort to handle them – will want to graph the distribution of these is per year.

https://spin.atomicobject.com/2017/03/22/parsing-excel-files-ruby/ looks like a useful post on looking at the libraries that parse spreadsheets. I don't know if any of these have an easy way of checking for cached or hidden data. It might be that we just want to use our own parsing specifically for identifying these rather than parsing all the data into ruby objects.

On 25 January 2023, @garethrees wrote

Do we need to consider both xls and xlsx? If we're not getting sent many xls files these days then it's probably not worth any extra effort to handle them – will want to graph the distribution of these is per year.

xls = FoiAttachment.where("filename ILIKE '%.xls'").group("TO_CHAR(created_at, 'YYYY')").count
xlsx = FoiAttachment.where("filename ILIKE '%.xlsx'").group("TO_CHAR(created_at,
'YYYY')").count

str = CSV.generate do |csv|
csv << %w(year xls xlsx)

xls.keys.compact.each do |year|
csv << [year, xls[year], xlsx[year]]
end
end

year xls xlsx 2012 7568 1043 2013 2037 1047 2014 1403 1758 2015 964 2179 2016 934 2734 2017 1272 5018 2018 9519 12650 2019 1024 8155 2020 828 6716 2021 582 8318 2022 498 7937 2023 54 470

On 4 September 2023 @garethrees wrote: Couple of related issues:

4861 contains some notes on looking inside a docx which I assume should be similar.
3233 has some exploration of excel file contents

On 4 September 2023 @garethrees wrote:

Here's a sketch:

def xlsx_contains_hidden_data?
   return false unless extension == 'xlsx'
   spreadsheet_contains_hidden_data(xlsx_extracted_data)
end
def xlsx_extracted_data
   # hack in extracting & peeking in to xlsx
end
def xls_contains_hidden_data?
   return false unless extension == 'xls'
   spreadsheet_contains_hidden_data(xls_indexed_data)
end
def xls_indexed_data
   # or could do a full extract of the data from the file?
   incoming_message.cached_attachment_text_clipped
end
def spreadsheet_contains_hidden_data?(data)
   return true if data =~ /pivotCache/
   return true if data =~ /state="hidden"/
   return true if data =~ /hidden="1"/
   # etc...
   false
end

On the CLI, for XLS we could do:

libreoffice --headless --convert-to xlsx your_file.xls
unzip your_file.xlsx -d your_file
# grep for signs of problem, e.g:
grep -r 'hidden="1"' your_file/

I'm not quite sure how Alaveteli is extracting XLS in such a way that we can currently identify certain issues. Seems to be using strings but that didn't allow me to identify an issue on a safe sample file which I could identify by converting to xlsx and grepping.

On 23 October 2023 @mdeuk wrote:

Update: Actailly, converting to xlsx is probably the easiest here libreoffice --headless --convert-to xlsx /tmp/#{@file}

Just a query - would this persist on the "finished article" - e.g. what a user sees[1], or would it only be a temporary thing?

Footnotes

A censor rule works on legacy Microsoft formats (e.g. .doc / .xls), but doesn't work on the OpenXML variants. ↩

On 24 October 2023 @garethrees wrote: Not intending to present a converted attachment at this time. I think the censor rules working on the OpenXML variants is possible – it's just code, right?! 😅 – but that's one for #1156.

On 20 December 2023, @garethrees wrote: For reference as of time of writing

FoiAttachment.where("filename ILIKE '%.xls'").count
# => 27146
FoiAttachment.where("filename ILIKE '%.xlsx'").count
# => 67141

On 4 April 2024 @garethrees wrote: Just dropping an improved script to group by IncomingMessage#created_at rather than FoiAttachment#created_at to get a truer representation of spreadsheets received per year.

xls = FoiAttachment.joins(:incoming_message).references(:incoming_messages).where("filename
ILIKE '%.xls'").group("TO_CHAR(incoming_messages.created_at, 'YYYY')").count
xlsx = FoiAttachment.joins(:incoming_message).references(:incoming_messages).where("filename
ILIKE '%.xlsx'").group("TO_CHAR(incoming_messages.created_at, 'YYYY')").count

str = CSV.generate do |csv|
     csv << %w(year xls xlsx total)

     xls.keys.compact.each do |year|
          year_xls = xls.fetch(year, 0)
          year_xlsx = xlsx.fetch(year, 0)
          total = year_xls + year_xlsx
          csv << [year, year_xls, year_xlsx, total]
     end
end

mysociety / alaveteli

Consider practicality of generating alerts when Excel files containing hidden data are released #8231

4861 contains some notes on looking inside a docx which I assume should be similar.

3233 has some exploration of excel file contents

Footnotes