mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Excel .xlsx file corrupted #3233

Open kingqueen3065 opened 8 years ago

kingqueen3065 commented 8 years ago

The spreadsheet attached to this response https://www.whatdotheyknow.com/request/complete_non_residential_busines_90#incoming-804454 won't open when downloaded from the site, but will when extracted from the original email; it appears that Alaveteli is in some way corrupting it.

garethrees commented 8 years ago

I've just re-parsed the attachment and it seems okay now (won't work through view as html, but I can successfully open in Apple Preview / Numbers).

Possibly related to https://github.com/mysociety/alaveteli/issues/3106 or similar. Worth keeping an eye out for other cases like this.

RichardTaylor commented 7 years ago

There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow in the thread at

https://www.whatdotheyknow.com/request/list_of_industrial_trade_effluen_5

garethrees commented 7 years ago

I can open both of the xlsx files in preview on Mac, but I don't have excel installed so can't check there. Might be worth uploading to the file store and adding a comment?

kingqueen3065 commented 7 years ago

I did a direct comparison: I downloaded the XLS spreadsheet from the WhatDoTheyKnow request page, and I then downloaded the original raw email, imported it into my mail client and them saved the spreadsheet. This resulted in two files with the same name and same file size:

161019 Whatdotheyknow com DOWNLOADED.xlsx 161019_Whatdotheyknow_com EXPORTED.xlsx

But a binary comparison of the two reveals that there is a difference between the two. To be precise, there is a difference in 7 specific Hex values at 7 Hex addresses. The original numbers were a variety, but they were all replaced with the same Hex number 78.

fc /b "161019 Whatdotheyknow com DOWNLOADED.xlsx" "161019_Whatdothey know_com EXPORTED.xlsx" Comparing files 161019 Whatdotheyknow com DOWNLOADED.xlsx and 161019_WHATDOTHEYK NOW_COM EXPORTED.XLSX 000AE38E: 78 37 000AE38F: 78 5A 000AE391: 78 4E 000AE392: 78 50 000AE394: 78 72 000AE395: 78 72 000AE396: 78 72

It is clear that Alaveteli or other software has a bug which is altering XLS files attached to emails from public authorities, and that alteration means that the file can be rejected by (some) software as corrupt. One also wonders if the bug could be occasionally resulting in alteration of a value in a spreadsheet, which would mean that the information we're publishing would not be the information the authority has provided, which is worrying.

RichardTaylor commented 7 years ago

The file "Payroll costs 2015 2016 300117.xlsx" at https://www.whatdotheyknow.com/request/staffing_costs_january_2016_to_d#incoming-935704

doesn't open in Preview or Excel on OSX when downloaded from WhatDoTheyKnow but does open in Excel when the raw email is imported into an email application and the file is obtained from there.

RichardTaylor commented 7 years ago

Files released at: https://www.whatdotheyknow.com/request/ambulance_response_times_15#incoming-938936 and https://www.whatdotheyknow.com/request/ambulance_response_times_15#incoming-938937

don't open in Excel.

Importing the raw message files into Mail app for OSX doesn't work for these messages.

RichardTaylor commented 7 years ago

Probably another example at:

https://www.whatdotheyknow.com/request/list_of_all_multi_academy_trusts_4#incoming-930343

perfunc commented 7 years ago

The zip enclosure is messed up. XLSX are basically zip files so you can use your basic ZIP tools to test:

$ unzip -t Accounting\ Officers\ with\ Trust\ address\ 08\ 12\ 16.xlsx
Archive:  /Users/perfunc/Downloads/Accounting Officers with Trust address 08 12 16.xlsx
    testing: [Content_Types].xml      OK
    testing: _rels/.rels              OK
    testing: xl/_rels/workbook.xml.rels   OK
    testing: xl/workbook.xml          OK
    testing: xl/sharedStrings.xml     OK
    testing: xl/worksheets/_rels/sheet1.xml.rels   OK
    testing: xl/theme/theme1.xml      OK
    testing: xl/styles.xml            OK
    testing: xl/worksheets/sheet1.xml   bad CRC f5380700  (should be cd8ff738)
    testing: docProps/app.xml         OK
    testing: docProps/core.xml        OK
    testing: xl/printerSettings/printerSettings1.bin   OK
    testing: docProps/custom.xml      OK
At least one error was detected in /Users/perfunc/Downloads/Accounting Officers with Trust address 08 12 16.xlsx.

Other tools report an uncompressed data size mismatch.

kingqueen3065 commented 7 years ago

Another one: https://www.whatdotheyknow.com/request/401222/response/975785/attach/5/Bristol%20TM%20SX%20BCD.xlsx

garethrees commented 7 years ago

More info in https://github.com/mysociety/alaveteli/issues/3118

RichardTaylor commented 7 years ago

There's another example at:

https://www.whatdotheyknow.com/request/a_request_for_data_relating_to_s

lizconlan commented 7 years ago

Possibly related to https://github.com/rubyzip/rubyzip#modify-docx-file-with-rubyzip

RichardTaylor commented 6 years ago

Looks like another example at:

https://www.whatdotheyknow.com/request/land_rover_para_recce_erm_list#incoming-1040744

RichardTaylor commented 6 years ago

Noting some concern among WhatDoTheyKnow volunteers that this bug might be prompting public bodies to send responses outside of WhatDoTheyKnow.

One wonders if they are aware of the .xlsx mangling bug on WhatDoTheyKnow and want to avoid it, though that doesn't affect .csv...

To be fair, I get hit by that xlsx bug fairly often, and maybe they do too. It's a fair point... And it looks as if they want to send an Excel file.

Also just to add a link to #458 as it looks related.

garethrees commented 6 years ago

Thanks for continuing to handle this. Its high on our list of bugs to fix – we think https://github.com/mysociety/alaveteli/pull/4224 might be a contributing factor, so once that's reviewed we can see if it makes a difference.

lizconlan commented 6 years ago

we think #4224 might be a contributing factor

I don't think #4224 by itself will fix it, we will probably still need to adopt this fix/workaround https://github.com/rubyzip/rubyzip#modify-docx-file-with-rubyzip

garethrees commented 6 years ago

And, they did actually send it to the site: https://www.whatdotheyknow.com/request/complete_non_residential_busines_873#incoming-1051512

But also to me ... As a follow-up, I asked why they publish PDFs and not Excel or CSV and the response is that they're worried that people will edit the data and then lie about what the authority actually published. Sigh.

turukawa commented 6 years ago

An example, with glitches in the text part of the raw email response:

https://www.whatdotheyknow.com/request/complete_non_residential_busines_778

RichardTaylor commented 6 years ago

Following the previous comment - I downloaded the raw incoming message at https://www.whatdotheyknow.com/request/complete_non_residential_busines_778#incoming-1058857

and imported it into the mail.app on OSX and the attachment opened fine.

The same was the case for the attachment at: https://www.whatdotheyknow.com/request/complete_non_residential_busines_778#incoming-1058757

I can't see any glitches in the raw email response either; there are some Microsoft formats used in the message but they're used in a routine manner which WhatDoTheyKnow handles appropriately. The "blocked" references relate to a Microsoft Outlook "feature".

RichardTaylor commented 5 years ago

Another example, this one is .xls not .xlsx

https://www.whatdotheyknow.com/request/licenced_premises_12#incoming-1274207

RichardTaylor commented 5 years ago

A user has reported Excel gave an error when the file in question at https://www.whatdotheyknow.com/request/complete_non_residential_busines_1584#incoming-1395720 was opened. The error contains:

must not contain '<'. Line 2,

and

An attribute value must not contain '<'.

which might give a hint as to how Alaveteli is corrupting the file.

The full error message:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error038160_01.xml</logFileName><summary>Errors were detected in file 'Birmingham NDR all properties Q1 EXTRACT DATE 03 07 2019.xlsx'</summary><removedParts><removedPart>Replaced Part: /xl/worksheets/sheet1.xml part with XML error.  An attribute value must not contain '&lt;'. Line 2, column 14652772.</removedPart></removedParts></recoveryLog>
RichardTaylor commented 4 years ago

Corrupted .xlsx documents reported at

https://www.whatdotheyknow.com/request/complete_non_residential_busines_1733#incoming-1455963 [08.Appendix N 9986 full list September 19.xlsx]

and

https://www.whatdotheyknow.com/request/complete_non_residential_busines_1716#incoming-1452678 [attachment.xlsx]

garethrees commented 4 years ago

About the only thing I can think of is non-ascii characters in the spreadsheet. We get that every now and then, and it does cause issues, e.g. "Erland F Sanderson Associates Ltd"

The software Excel parser may be running into a wall when it hits a special character, and break out early, leading to a badly-structured file.

I use a Python library called Unidecode to push problem characters to their nearest ascii equivalent. I’m sure there must be something similar for Ruby.

NyashaDuri commented 4 years ago

Corrupted .xlxs document reported at:

https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]

gbp commented 4 years ago

Corrupted .xlxs document reported at:

https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]

Can't see anything wrong with this one. Opens fine for me and the file hash compared to attachement in the raw email is the same.

turukawa commented 4 years ago

@gbp I can download but not open the file. It's corrupted. Usually you can open it from the raw email (in which case, please send it to me) when it fails like this.

gbp commented 4 years ago

@gbp I can download but not open the file. It's corrupted. Usually you can open it from the raw email (in which case, please send it to me) when it fails like this.

@turukawa the files are identical but will forward to your email address just in case:

$ shasum *.xlsx
a7679873ddafdda939e35e52e038cbb35ca01993  FOI 1716905 - downloaded.xlsx
a7679873ddafdda939e35e52e038cbb35ca01993  FOI 1716905 - raw email.xlsx
mdeuk commented 4 years ago

Corrupted .xlxs document reported at: https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]

Can't see anything wrong with this one. Opens fine for me and the file hash compared to attachement in the raw email is the same.

I can't open either version (Alaveteli processed or raw) in Excel (Mac version 16.36 20030201) without the classic error message:

"We found a problem with some content in ’FOI 1716905.xlsx’. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes."

`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Repair Result to FOI 17169054.xmlErrors were detected in file ’/Users/mde/Desktop/temp1/FOI 1716905.xlsx’Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.` The file seems to open okay outside of Excel (e.g. Google Sheets), therefore I'd surmise it's either a bug in the file format itself or a bug in Excel. The latter seems probable given the o365 version of Excel can't [process the file either](https://view.officeapps.live.com/op/view.aspx?src=https://www.whatdotheyknow.com/request/638161/response/1531483/attach/3/FOI%201716905.xlsx?cookie_passthrough=1).
MattK1234 commented 4 years ago

A further report of the Excel download issue at https://www.whatdotheyknow.com/request/658552/response/1559006/attach/6/NNDR%20Live%20accounts%20with%20amount%20charged%20empty%20status.xlsx?cookie_passthrough=1

for request https://www.whatdotheyknow.com/request/complete_non_residential_busines_1826#incoming-1559006

We have provided the user with a copy from the raw email, which appears to open successfully.

RichardTaylor commented 3 years ago

There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow at

https://www.whatdotheyknow.com/request/698207/response/1659983/attach/5/NDR%20All%20Properties%20Q2%20extract%20date%2001%2010%202020.xlsx.xlsx?cookie_passthrough=1

RichardTaylor commented 3 years ago

There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow at

https://www.whatdotheyknow.com/request/foi_2020_0068157_file_provided_c#incoming-1697111

RichardTaylor commented 3 years ago

Suspected further case

https://www.whatdotheyknow.com/request/occupancy_for_the_1920_teaching_3#comment-95315

RichardTaylor commented 3 years ago

Suspected further case

https://www.whatdotheyknow.com/request/complete_non_residential_busines_2231#incoming-1784266

RichardTaylor commented 2 years ago

We have a case of a corrupted zip file which was attached to a response at

https://www.whatdotheyknow.com/request/email_address_for_educational_vi#incoming-1955294

the file works when the raw email is opened in a mail application.

Do we want to note this here and generalise the ticket to ~"Attachment in response corrupted by Alaveteli" ?

sallytay commented 2 years ago

As noted in the Support inbox, another example of this here https://www.whatdotheyknow.com/request/complete_non_residential_busines_2388#incoming-2036662

mdeuk commented 2 years ago

A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded from the request itself (and as such, the Google viewer doesn't render them correctly); but, if downloaded from the raw email in /admin, the files themselves are fine.

garethrees commented 2 years ago

A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded…

Interestingly I can open them in macOS' Numbers.

mdeuk commented 2 years ago

A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded…

Interestingly I can open them in macOS' Numbers.

Yeah that's quite common - albeit, I don't think we've identified why. I assume something within the file structure which Numbers (and possibly other non Microsoft clients) doesn't care about is being broken. Perhaps it's a case of Excel expecting something to exist in a given format, and not playing ball otherwise.

turukawa commented 2 years ago

@garethrees @mdeuk I usually ask for volunteers to open in Numbers, and resave, then send to me when there's an issue. I'm going to assume whoever developed this feature initially had Numbers to test with, but not Excel. One of the earlier suggestions was an inappropriately closed XML tag after whatever redactions happen in the app.

FOIMonkey commented 2 years ago

+1 Same thing has happened here - the file works when opened from the raw email, but won't open when downloaded from the site: https://www.whatdotheyknow.com/request/titles_owned_by_overseas_compani_2#incoming-2114604

HelenWDTK commented 1 year ago

+1 Another example today. We were alerted by a public authority who were contacted by the requester due to them being unable to access the file. They said that more than one request had been affected recently: https://www.whatdotheyknow.com/request/social_housing_and_right_to_buy_175

edit - it wouldn't open in numbers.

mdeuk commented 1 year ago

+1, a user reports that the file at https://www.whatdotheyknow.com/request/charities_with_turnover_of_1_mil#incoming-373688 has the same issue.

Curiously, this is in the <2004 Excel format, rather than "Office Open XML", but the bug seems to manifest in the same way. The file contained in the raw email opens without issue in Excel for Mac.

HelenWDTK commented 1 year ago

+1 We've had another instance of this here: https://www.whatdotheyknow.com/request/waste_container_orders#incoming-2307356

HelenWDTK commented 2 months ago

+1 happened again today