Open kingqueen3065 opened 8 years ago
I've just re-parsed the attachment and it seems okay now (won't work through view as html, but I can successfully open in Apple Preview / Numbers).
Possibly related to https://github.com/mysociety/alaveteli/issues/3106 or similar. Worth keeping an eye out for other cases like this.
There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow in the thread at
https://www.whatdotheyknow.com/request/list_of_industrial_trade_effluen_5
I can open both of the xlsx
files in preview on Mac, but I don't have excel installed so can't check there. Might be worth uploading to the file store and adding a comment?
I did a direct comparison: I downloaded the XLS spreadsheet from the WhatDoTheyKnow request page, and I then downloaded the original raw email, imported it into my mail client and them saved the spreadsheet. This resulted in two files with the same name and same file size:
161019 Whatdotheyknow com DOWNLOADED.xlsx 161019_Whatdotheyknow_com EXPORTED.xlsx
But a binary comparison of the two reveals that there is a difference between the two. To be precise, there is a difference in 7 specific Hex values at 7 Hex addresses. The original numbers were a variety, but they were all replaced with the same Hex number 78.
fc /b "161019 Whatdotheyknow com DOWNLOADED.xlsx" "161019_Whatdothey know_com EXPORTED.xlsx" Comparing files 161019 Whatdotheyknow com DOWNLOADED.xlsx and 161019_WHATDOTHEYK NOW_COM EXPORTED.XLSX 000AE38E: 78 37 000AE38F: 78 5A 000AE391: 78 4E 000AE392: 78 50 000AE394: 78 72 000AE395: 78 72 000AE396: 78 72
It is clear that Alaveteli or other software has a bug which is altering XLS files attached to emails from public authorities, and that alteration means that the file can be rejected by (some) software as corrupt. One also wonders if the bug could be occasionally resulting in alteration of a value in a spreadsheet, which would mean that the information we're publishing would not be the information the authority has provided, which is worrying.
The file "Payroll costs 2015 2016 300117.xlsx" at https://www.whatdotheyknow.com/request/staffing_costs_january_2016_to_d#incoming-935704
doesn't open in Preview or Excel on OSX when downloaded from WhatDoTheyKnow but does open in Excel when the raw email is imported into an email application and the file is obtained from there.
Files released at: https://www.whatdotheyknow.com/request/ambulance_response_times_15#incoming-938936 and https://www.whatdotheyknow.com/request/ambulance_response_times_15#incoming-938937
don't open in Excel.
Importing the raw message files into Mail app for OSX doesn't work for these messages.
Probably another example at:
https://www.whatdotheyknow.com/request/list_of_all_multi_academy_trusts_4#incoming-930343
The zip enclosure is messed up. XLSX are basically zip files so you can use your basic ZIP tools to test:
$ unzip -t Accounting\ Officers\ with\ Trust\ address\ 08\ 12\ 16.xlsx
Archive: /Users/perfunc/Downloads/Accounting Officers with Trust address 08 12 16.xlsx
testing: [Content_Types].xml OK
testing: _rels/.rels OK
testing: xl/_rels/workbook.xml.rels OK
testing: xl/workbook.xml OK
testing: xl/sharedStrings.xml OK
testing: xl/worksheets/_rels/sheet1.xml.rels OK
testing: xl/theme/theme1.xml OK
testing: xl/styles.xml OK
testing: xl/worksheets/sheet1.xml bad CRC f5380700 (should be cd8ff738)
testing: docProps/app.xml OK
testing: docProps/core.xml OK
testing: xl/printerSettings/printerSettings1.bin OK
testing: docProps/custom.xml OK
At least one error was detected in /Users/perfunc/Downloads/Accounting Officers with Trust address 08 12 16.xlsx.
Other tools report an uncompressed data size mismatch.
More info in https://github.com/mysociety/alaveteli/issues/3118
There's another example at:
https://www.whatdotheyknow.com/request/a_request_for_data_relating_to_s
Possibly related to https://github.com/rubyzip/rubyzip#modify-docx-file-with-rubyzip
Looks like another example at:
https://www.whatdotheyknow.com/request/land_rover_para_recce_erm_list#incoming-1040744
Noting some concern among WhatDoTheyKnow volunteers that this bug might be prompting public bodies to send responses outside of WhatDoTheyKnow.
One wonders if they are aware of the .xlsx mangling bug on WhatDoTheyKnow and want to avoid it, though that doesn't affect .csv...
To be fair, I get hit by that xlsx bug fairly often, and maybe they do too. It's a fair point... And it looks as if they want to send an Excel file.
Also just to add a link to #458 as it looks related.
Thanks for continuing to handle this. Its high on our list of bugs to fix – we think https://github.com/mysociety/alaveteli/pull/4224 might be a contributing factor, so once that's reviewed we can see if it makes a difference.
we think #4224 might be a contributing factor
I don't think #4224 by itself will fix it, we will probably still need to adopt this fix/workaround https://github.com/rubyzip/rubyzip#modify-docx-file-with-rubyzip
And, they did actually send it to the site: https://www.whatdotheyknow.com/request/complete_non_residential_busines_873#incoming-1051512
But also to me ... As a follow-up, I asked why they publish PDFs and not Excel or CSV and the response is that they're worried that people will edit the data and then lie about what the authority actually published. Sigh.
An example, with glitches in the text part of the raw email response:
https://www.whatdotheyknow.com/request/complete_non_residential_busines_778
Following the previous comment - I downloaded the raw incoming message at https://www.whatdotheyknow.com/request/complete_non_residential_busines_778#incoming-1058857
and imported it into the mail.app on OSX and the attachment opened fine.
The same was the case for the attachment at: https://www.whatdotheyknow.com/request/complete_non_residential_busines_778#incoming-1058757
I can't see any glitches in the raw email response either; there are some Microsoft formats used in the message but they're used in a routine manner which WhatDoTheyKnow handles appropriately. The "blocked" references relate to a Microsoft Outlook "feature".
Another example, this one is .xls not .xlsx
https://www.whatdotheyknow.com/request/licenced_premises_12#incoming-1274207
A user has reported Excel gave an error when the file in question at https://www.whatdotheyknow.com/request/complete_non_residential_busines_1584#incoming-1395720 was opened. The error contains:
must not contain '<'. Line 2,
and
An attribute value must not contain '<'.
which might give a hint as to how Alaveteli is corrupting the file.
The full error message:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error038160_01.xml</logFileName><summary>Errors were detected in file 'Birmingham NDR all properties Q1 EXTRACT DATE 03 07 2019.xlsx'</summary><removedParts><removedPart>Replaced Part: /xl/worksheets/sheet1.xml part with XML error. An attribute value must not contain '<'. Line 2, column 14652772.</removedPart></removedParts></recoveryLog>
Corrupted .xlsx documents reported at
https://www.whatdotheyknow.com/request/complete_non_residential_busines_1733#incoming-1455963 [08.Appendix N 9986 full list September 19.xlsx]
and
https://www.whatdotheyknow.com/request/complete_non_residential_busines_1716#incoming-1452678 [attachment.xlsx]
About the only thing I can think of is non-ascii characters in the spreadsheet. We get that every now and then, and it does cause issues, e.g.
"Erland F Sanderson Associates Ltd"
The software Excel parser may be running into a wall when it hits a special character, and break out early, leading to a badly-structured file.
I use a Python library called Unidecode to push problem characters to their nearest ascii equivalent. I’m sure there must be something similar for Ruby.
Corrupted .xlxs document reported at:
https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]
Corrupted .xlxs document reported at:
https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]
Can't see anything wrong with this one. Opens fine for me and the file hash compared to attachement in the raw email is the same.
@gbp I can download but not open the file. It's corrupted. Usually you can open it from the raw email (in which case, please send it to me) when it fails like this.
@gbp I can download but not open the file. It's corrupted. Usually you can open it from the raw email (in which case, please send it to me) when it fails like this.
@turukawa the files are identical but will forward to your email address just in case:
$ shasum *.xlsx
a7679873ddafdda939e35e52e038cbb35ca01993 FOI 1716905 - downloaded.xlsx
a7679873ddafdda939e35e52e038cbb35ca01993 FOI 1716905 - raw email.xlsx
Corrupted .xlxs document reported at: https://www.whatdotheyknow.com/request/complete_non_residential_busines_1765#incoming-1531483 [FOI 1716905.xlsx]
Can't see anything wrong with this one. Opens fine for me and the file hash compared to attachement in the raw email is the same.
I can't open either version (Alaveteli processed or raw) in Excel (Mac version 16.36 20030201) without the classic error message:
"We found a problem with some content in ’FOI 1716905.xlsx’. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes."
`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
A further report of the Excel download issue at https://www.whatdotheyknow.com/request/658552/response/1559006/attach/6/NNDR%20Live%20accounts%20with%20amount%20charged%20empty%20status.xlsx?cookie_passthrough=1
for request https://www.whatdotheyknow.com/request/complete_non_residential_busines_1826#incoming-1559006
We have provided the user with a copy from the raw email, which appears to open successfully.
There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow at
There is another case of an Excel file which works when the whole mail message is imported into a mail application but doesn't work when downloaded via WhatDoTheyKnow at
https://www.whatdotheyknow.com/request/foi_2020_0068157_file_provided_c#incoming-1697111
Suspected further case
https://www.whatdotheyknow.com/request/occupancy_for_the_1920_teaching_3#comment-95315
We have a case of a corrupted zip file which was attached to a response at
https://www.whatdotheyknow.com/request/email_address_for_educational_vi#incoming-1955294
the file works when the raw email is opened in a mail application.
Do we want to note this here and generalise the ticket to ~"Attachment in response corrupted by Alaveteli" ?
As noted in the Support inbox, another example of this here https://www.whatdotheyknow.com/request/complete_non_residential_busines_2388#incoming-2036662
A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded from the request itself (and as such, the Google viewer doesn't render them correctly); but, if downloaded from the raw email in /admin, the files themselves are fine.
A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded…
Interestingly I can open them in macOS' Numbers.
A recent example is contained in https://www.whatdotheyknow.com/request/central_ticket_office_2#incoming-2074623 - the Excel files are corrupted when downloaded…
Interestingly I can open them in macOS' Numbers.
Yeah that's quite common - albeit, I don't think we've identified why. I assume something within the file structure which Numbers (and possibly other non Microsoft clients) doesn't care about is being broken. Perhaps it's a case of Excel expecting something to exist in a given format, and not playing ball otherwise.
@garethrees @mdeuk I usually ask for volunteers to open in Numbers, and resave, then send to me when there's an issue. I'm going to assume whoever developed this feature initially had Numbers to test with, but not Excel. One of the earlier suggestions was an inappropriately closed XML tag after whatever redactions happen in the app.
+1 Same thing has happened here - the file works when opened from the raw email, but won't open when downloaded from the site: https://www.whatdotheyknow.com/request/titles_owned_by_overseas_compani_2#incoming-2114604
+1 Another example today. We were alerted by a public authority who were contacted by the requester due to them being unable to access the file. They said that more than one request had been affected recently: https://www.whatdotheyknow.com/request/social_housing_and_right_to_buy_175
edit - it wouldn't open in numbers.
+1, a user reports that the file at https://www.whatdotheyknow.com/request/charities_with_turnover_of_1_mil#incoming-373688 has the same issue.
Curiously, this is in the <2004 Excel format, rather than "Office Open XML", but the bug seems to manifest in the same way. The file contained in the raw email opens without issue in Excel for Mac.
+1 We've had another instance of this here: https://www.whatdotheyknow.com/request/waste_container_orders#incoming-2307356
+1 happened again today
The spreadsheet attached to this response https://www.whatdotheyknow.com/request/complete_non_residential_busines_90#incoming-804454 won't open when downloaded from the site, but will when extracted from the original email; it appears that Alaveteli is in some way corrupting it.