Open WilliamWDTK opened 2 years ago
Might be something that we can render through the Google Docs viewer.
https://github.com/mysociety/alaveteli/blob/0.40.0.0/app/models/foi_attachment.rb#L257-L274
Also the current icon on the WDTK view is just a question mark:
(from https://www.whatdotheyknow.com/request/the_failures_of_dumfries_and_gal#incoming-1501531)
rather than something like the Word icon:
+1 Most City of London Police responses are .odt files and it's a pain to have to download them every time
Also North Lincs Council.
Google Docs viewer doesn't work, but Microsoft's Office Live View does.
Adding this should be straightforward:
AttachmentToHTML::Adapters::OfficeLiveViewViewer
or whatever we want to call itFoiAttachment#has_office_live_view_viewer?
with the correct content type mappingsAttachmentToHTML#fallback_adapter_for
to wire up the adapter.Microsoft's Live View also seems nicer for XLSX files:
(from this request)
Google handles .csv like this (not ideal), Microsoft doesn't seem to. (From here).
I've taken the liberty of changing the title, which I hope won't be a problem…
For CSV, there's also things like this client-side JS viewer.
Prevalence of file types (according to searches of the format https://www.whatdotheyknow.com/search/filetype:EXTENSION/all
)
Extension | Count (about…) | Note | Example | Current Icon | Suggested icon improvement | Previewable? | Preview suggestion |
---|---|---|---|---|---|---|---|
csv |
48 | I think these might sometimes end up as .csv.txt , but I'm not sure how to find those… (see https://github.com/mysociety/alaveteli/issues/6480) |
This response to this request | (source) | Change to like that for .xls , but with CSV instead of X |
No | (insert suggestion here) |
txt |
8000 | This mostly seems to duplicate the request email | - | (source) | Perhaps add TXT ? |
Yes/no (please select) | (insert suggestion here) |
xls |
9700 | - | - | (source) | I don't think we need to distinguish between .xls and .xlsx |
Yes/no (please select) | (insert suggestion here) |
xlsx |
17000 | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
doc |
67000 | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
docx |
31000 | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
pdf |
220000 | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
xml |
24 | - | This unit diagram file (almost 9 MB) in response to this request or this one in some unknown format in response to this other request | (source) | Perhaps make it like that for .html , but put XML below the angled brackets, rather than HTML |
Yes/no (please select) | (insert suggestion here) |
jpg |
40000 | - | - | (source) | Include the extension below the mountains? | Yes/no (please select) | (insert suggestion here) |
png |
29000 | - | - | (source) | Include the extension below the mountains? | Yes/no (please select) | (insert suggestion here) |
switch |
191 | Our favourite! | - | (source) | An icon with a padlocked envelope? Or a frowning face(!) | Yes/no (please select) | (insert suggestion here) |
mp3 |
10 | - | This recording from this request | (source) | Soundwave? | Yes/no (please select) | (insert suggestion here) |
mp4 |
1 | - | This 8MB video demonstrating how to update the TTK app (railway ticketing) on a smartphone in response to this request to LNER | (source) | Camcorder icon, video casette or a clapperboard? | Yes/no (please select) | (insert suggestion here) |
wav |
3 | - | This announcement fragment released in response to this request to SE Trains Limited | (source) | Soundwave? | Yes/no (please select) | (insert suggestion here) |
zip |
1900 | - | This ZIP (7.6MB) released in response to the same request as above | (source) | Indicate with text ZIP |
Yes/no (please select) | (insert suggestion here) |
gz |
1 | - | This file in response to this request for web server access logs | (source) | Like .zip , but with text GZ ? |
Yes/no (please select) | (insert suggestion here) |
html |
4500 | - | - | (source) | Yes/no (please select) | (insert suggestion here) | |
tif |
342 | - | This diagram of a concourse in Gateshead released in response to this request | (source) | Yes/no (please select) | (insert suggestion here) | |
odp |
1 | - | This governance management structure diagram from this request | (source) | Like .ppt /.pptx but with text ODP |
Yes/no (please select) | (insert suggestion here) |
odt |
360 | - | - | (source) | Like .doc /.docx with ODT instead of W ? |
Yes/no (please select) | (insert suggestion here) |
ods |
49 | - | - | (source) | Like .xls /.xlsx with ODS instead of X ? |
Yes/no (please select) | (insert suggestion here) |
rar |
4 | - | This file released in response to this request to the ICO | (source) | Like with .zip , but with RAR text |
Yes/no (please select) | (insert suggestion here) |
ppt |
242 | - | - | (source) | Add text PPT |
Yes/no (please select) | (insert suggestion here) |
pptx |
295 | - | - | (source) | Add text PPT |
Yes/no (please select) | (insert suggestion here) |
rtf |
1400 | - | This request | (source) | Should be text lines? Thought I saw that icon in the code… | Yes/no (please select) | (insert suggestion here) |
pages |
1 | - | This request | (source) | Should be text lines, perhaps. | No | Neither Google nor Microsoft previewers work, though I notice if I try to open it, I can access a preview image () |
@WilliamWDTK That table is fascinating, thanks for putting it together. It's interesting to see that we're closing in on 250,000 PDFs.
There are also some other extension types that you can't use the search to find (mainly shapefiles) eg https://www.whatdotheyknow.com/request/school_catchment_areas_2#incoming-619600 or https://www.whatdotheyknow.com/request/public_rights_of_way_data_4#incoming-2023061
Google Docs viewer doesn't work, but Microsoft's Office Live View does.
Adding this should be straightforward:
1. Add an `AttachmentToHTML::Adapters::OfficeLiveViewViewer` or whatever we want to call it 2. Add `FoiAttachment#has_office_live_view_viewer?` with the correct content type mappings 3. Add a conditional to `AttachmentToHTML#fallback_adapter_for` to wire up the adapter.
We'll probably want a WDTK theme ticket here for any privacy notice changes - as we'd be presenting this in a WDTK branded window, so we need to make the user aware of the cookies being set by this service.
A workaround could perhaps be to simply link directly to it, rather than wrapping around an iframe - but that might be a bit clunky. Worth noting, perhaps, that we likely should have a similar disclosure for Google Docs viewer - which sets cookies too - and like the Microsoft variety, doesn't give any visible indication at time of loading.
I've not raised such a ticket yet, as I'm not sure how close we are to this being implemented.
I've not raised such a ticket yet, as I'm not sure how close we are to this being implemented.
No immediate plans, so this can be created when we start implementing.
Not sure if this is the appropriate ticket to raise this on, but for this attachment here, the Google preview has mucked up the formatting, while Microsoft's previewer keeps it nice.
.eml
and .msg
I think this was raised elsewhere, possibly by @FOIMonkey.
@RichardTaylor mentioned in #3338. Not sure why I thought it was @FOIMonkey.
For CSV, there's also things like this client-side JS viewer.
I'd actually quite like us to look at Datasette [^1] - the 'lite' version is entirely browser based, but it might be trivial to spin up an instance of it. It's a powerful tool for data analysis, and given much of the open data we've got in CSVs can be rather bulky, this seems a logical way to present things with minimal overhead.
The product is already used elsewhere in mySociety [^2], so it'd be interesting to look at what it'd take to have Alaveteli serve files here - do we send the right headers / flags^3 out?
[^1]: A recent claim to fame (or infamy!) for this is @simonw's ScotRail random apology generator. [^2]: e.g. https://mysociety.github.io/wdtk_authorities_list/datasets/whatdotheyknow_authorities_dataset/latest
I'd actually quite like us to look at Datasette
Me too! https://twitter.com/_gareth/status/1521918004690399232. I did spike this out, but at the time Datasette lite didn't accept a CSV URL, only an sqlite URL, so that was annoying. Seems like it would be pretty easy to add now.
commit 3ae66c5825bd3156d24f5230b5cc49e79488800e
Author: Gareth Rees <gareth@>
Date: Wed May 4 20:09:24 2022 +0100
Basically working CSV to SQL
Downloads as "foo.csv", but then renaming to "foo.sqlite" works as
expected:
sqlite> select * from foo;
1|2
3|4
diff --git a/Gemfile b/Gemfile
index 1e93ef0d6..87876c613 100644
--- a/Gemfile
+++ b/Gemfile
@@ -175,6 +175,8 @@ if rails_upgrade? && RUBY_VERSION < '3.1'
gem 'uri', '0.10.0'
end
+gem 'csv_to_sqlite3', require: 'csv_to_sqlite'
+
group :test do
gem 'fivemat', '~> 1.3.7'
gem 'webmock', '~> 3.14.0'
diff --git a/Gemfile.lock b/Gemfile.lock
index 332c72c80..4a3ac6dcb 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -169,6 +169,9 @@ GEM
crack (0.4.5)
rexml
crass (1.0.6)
+ csv_to_sqlite3 (0.1.4)
+ sqlite3 (~> 1.4, >= 1.4.2)
+ tty-progressbar
daemons (1.4.0)
dalli (3.2.2)
dante (0.2.0)
@@ -467,7 +470,9 @@ GEM
actionpack (>= 5.2)
activesupport (>= 5.2)
sprockets (>= 3.0.0)
+ sqlite3 (1.4.2)
statistics2 (0.54)
+ strings-ansi (0.2.0)
stripe (5.55.0)
syslog_protocol (0.9.2)
text (1.3.1)
@@ -478,6 +483,13 @@ GEM
thor (1.2.1)
tilt (2.0.10)
trailblazer-option (0.1.2)
+ tty-cursor (0.7.1)
+ tty-progressbar (0.18.2)
+ strings-ansi (~> 0.2)
+ tty-cursor (~> 0.7)
+ tty-screen (~> 0.8)
+ unicode-display_width (>= 1.6, < 3.0)
+ tty-screen (0.8.1)
tzinfo (2.0.4)
concurrent-ruby (~> 1.0)
uber (0.1.0)
@@ -529,6 +541,7 @@ DEPENDENCIES
capistrano (~> 2.15.0, < 3.0.0)
capybara (~> 3.37.1)
charlock_holmes (~> 0.7.7)
+ csv_to_sqlite3
dalli (~> 3.2.2)
exception_notification (~> 4.5.0)
factory_bot_rails (~> 6.2.0)
@@ -602,3 +615,6 @@ DEPENDENCIES
xapian-full-alaveteli (~> 1.4.18.1)
xml-simple (~> 1.1.9)
zip_tricks (~> 5.6.0)
+
+BUNDLED WITH
+ 2.1.4
diff --git a/app/controllers/attachments_controller.rb b/app/controllers/attachments_controller.rb
index c64174f82..a3524ca03 100644
--- a/app/controllers/attachments_controller.rb
+++ b/app/controllers/attachments_controller.rb
@@ -13,7 +13,7 @@ class AttachmentsController < ApplicationController
include ProminenceHeaders
- around_action :cache_attachments
+ around_action :cache_attachments, except: :show_as_sqlite
before_action :authenticate_attachment
before_action :authenticate_attachment_as_html, only: :show_as_html
@@ -36,6 +36,29 @@ def show
render body: body, content_type: content_type
end
+ def show_as_sqlite
+ raise StandardError unless content_type == 'text/csv'
+
+ body = @incoming_message.apply_masks(
+ @attachment.default_body,
+ @attachment.content_type
+ )
+
+ CsvToSqlite::Database.class_eval do
+ def target_dir
+ return './spec/src/' if ENV.key? "test"
+ './tmp'
+ end
+ end
+
+ tmp = "./tmp/#{params[:file_name]}"
+ File.write(tmp, body)
+ CsvToSqlite::CsvToSQLite.new([tmp]).run
+ sql_body = File.read('./tmp/data.sqlite3')
+
+ render body: sql_body, content_type: 'application/octet-stream'
+ end
+
def show_as_html
# images made during conversion (e.g. images in PDF files) are put in the
# cache directory, so the same cache code in cache_attachments above will
diff --git a/config/routes.rb b/config/routes.rb
index ed94afd31..7f5aa1471 100644
--- a/config/routes.rb
+++ b/config/routes.rb
@@ -110,6 +110,10 @@
:as => :similar_request,
:via => :get
+ match '/request/:id/response/:incoming_message_id/attach/sqlite/:part/*file_name' => 'attachments#show_as_sqlite',
+ :format => false,
+ :via => :get
+
match '/request/:id/response/:incoming_message_id/attach/html/:part/*file_name' => 'attachments#show_as_html',
:format => false,
:as => :get_attachment_as_html,
commit cfa3bfcd2022b45f2bfe76e69a332957e689f762
Author: Gareth Rees <gareth@>
Date: Wed May 4 19:33:02 2022 +0100
Use correct mime type for cached CSV attachments
Currently they fall back to application/octet-stream in
AttachmentsController#content_type
diff --git a/lib/alaveteli_file_types.rb b/lib/alaveteli_file_types.rb
index ae23ffc9f..744270254 100644
--- a/lib/alaveteli_file_types.rb
+++ b/lib/alaveteli_file_types.rb
@@ -2,6 +2,7 @@ class AlaveteliFileTypes
# To add an image, create a file with appropriate name corresponding to the
# mime type in app/assets/images/content_type/ e.g. icon_image_tiff_large.png
FileExtensionToMimeType = {
+ 'csv' => 'text/csv',
"txt" => 'text/plain',
"pdf" => 'application/pdf',
"rtf" => 'application/rtf',
I just tried two WDTK CSVs (example) and both resulted in a Datasette error:
Traceback (most recent call last):
File "/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception
File "/lib/python3.10/asyncio/tasks.py", line 234, in __step
result = coro.throw(exc)
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 500, in eval_code_async
await CodeRunner(
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 353, in run_async
await coroutine
File "<exec>", line 53, in <module>
File "/lib/python3.10/site-packages/pyodide/http.py", line 231, in pyfetch
url, await _jsfetch(url, to_js(kwargs, dict_converter=Object.fromEntries))
File "/lib/python3.10/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
future.result()
File "/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception
JsException: TypeError: Failed to fetch
Perhaps we need to apply cfa3bfcd2022b45f2bfe76e69a332957e689f762 from the diff above to make this work.
Still getting the failure even after #7239.
Looking at the previous example WDTK attachment, it may not be parsable as CSV – they've just saved an XLS as CSV.
This example ought to work, but still fails with the same error.
Not sure if there's anything weird going on with how we're responding to the HTTP request that would cause the failure?
$ curl -I https://www.whatdotheyknow.com/request/835632/response/2024608/attach/2/FOI%20Disposals%2025%20Feb%202022.orig%201.csv
HTTP/2 200
server: nginx
date: Thu, 08 Sep 2022 08:27:22 GMT
content-type: text/csv
content-length: 40049
last-modified: Thu, 08 Sep 2022 08:23:40 GMT
etag: "9c71-5e82625980b7c"
x-url: /request/835632/response/2024608/attach/2/FOI%20Disposals%2025%20Feb%202022.orig%201.csv
age: 0
accept-ranges: bytes
Not sure if there's anything weird going on with how we're responding to the HTTP request that would cause the failure?
We don't appear to be passing the header access-control-allow-origin: *
- which is required here. I assume we'll need a sysadmin ticket?
We don't appear to be passing the header
access-control-allow-origin: *
Doh, RTFM Gareth. AIUI we can do that in Rails on a per-action basis. Don't know if there are security implications.
Updated prevalence of file types (according to searches of the format https://www.whatdotheyknow.com/search/filetype:EXTENSION/all
)
I can't remember the exact method, but here I'm just scrolling to the last page of results and seeing what the count ends up as. This may affect the values, as the count on the first page doesn't always match that on the last. Of course, this is the count of requests that have an attachment of that type, so requests with multiple attachments of one type will only count for one, and a request with attachments of multiple types will count as one for each type present.
Extension | Count (about…) | Note | Example | Current Icon | Suggested icon improvement | Previewable? | Preview suggestion |
---|---|---|---|---|---|---|---|
csv |
160 (+112) | I think these might sometimes end up as .csv.txt , but I'm not sure how to find those… (see https://github.com/mysociety/alaveteli/issues/6480) |
This response to this request | (source) | Change to like that for .xls , but with CSV instead of X |
No | (insert suggestion here) |
txt |
8600 (+600) | This mostly seems to duplicate the request email | - | (source) | Perhaps add TXT ? |
Yes/no (please select) | (insert suggestion here) |
xls |
10021 (+321) | - | - | (source) | I don't think we need to distinguish between .xls and .xlsx |
Yes/no (please select) | (insert suggestion here) |
xlsx |
19000 (+2000) | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
doc |
72000 (+5000) | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
docx |
36000 (+5000) | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
pdf |
250000 (+30000) | - | - | (source) | - | Yes/no (please select) | (insert suggestion here) |
xml |
24 (+0) | - | This unit diagram file (almost 9 MB) in response to this request or this one in some unknown format in response to this other request | (source) | Perhaps make it like that for .html , but put XML below the angled brackets, rather than HTML |
Yes/no (please select) | (insert suggestion here) |
jpg |
50000 (+10000) | - | - | (source) | Include the extension below the mountains? | Yes/no (please select) | (insert suggestion here) |
png |
40000 (+11000) | - | - | (source) | Include the extension below the mountains? | Yes/no (please select) | (insert suggestion here) |
switch |
235 (+44) | Our favourite! | - | (source) | An icon with a padlocked envelope? Or a frowning face(!) | Yes/no (please select) | (insert suggestion here) |
mp3 |
10 (+0) | - | This recording from this request | (source) | Soundwave? | Yes/no (please select) | (insert suggestion here) |
mp4 |
1 (+0) | - | This 8MB video demonstrating how to update the TTK app (railway ticketing) on a smartphone in response to this request to LNER | (source) | Camcorder icon, video casette or a clapperboard? | Yes/no (please select) | (insert suggestion here) |
wav |
5 (+2) | - | This announcement fragment released in response to this request to SE Trains Limited | (source) | Soundwave? | Yes/no (please select) | (insert suggestion here) |
zip |
2006 (+106) | - | This ZIP (7.6MB) released in response to the same request as above | (source) | Indicate with text ZIP |
Yes/no (please select) | (insert suggestion here) |
gz |
1 (+0) | - | This file in response to this request for web server access logs | (source) | Like .zip , but with text GZ ? |
Yes/no (please select) | (insert suggestion here) |
html |
4983 (+483) | - | - | (source) | Yes/no (please select) | (insert suggestion here) | |
tif |
348 (+6) | - | This diagram of a concourse in Gateshead released in response to this request | (source) | Yes/no (please select) | (insert suggestion here) | |
odp |
3 (+2) | - | This governance management structure diagram from this request | (source) | Like .ppt /.pptx but with text ODP |
Yes/no (please select) | (insert suggestion here) |
odt |
558 (+198) | - | - | (source) | Like .doc /.docx with ODT instead of W ? |
Yes/no (please select) | (insert suggestion here) |
ods |
56 (+7) | - | - | (source) | Like .xls /.xlsx with ODS instead of X ? |
Yes/no (please select) | (insert suggestion here) |
rar |
4 (+0) | - | This file released in response to this request to the ICO | (source) | Like with .zip , but with RAR text |
Yes/no (please select) | (insert suggestion here) |
ppt |
245 (+3) | - | - | (source) | Add text PPT |
Yes/no (please select) | (insert suggestion here) |
pptx |
332 (+37) | - | - | (source) | Add text PPT |
Yes/no (please select) | (insert suggestion here) |
rtf |
1649 (+249) | - | This request | (source) | Should be text lines? Thought I saw that icon in the code… | Yes/no (please select) | (insert suggestion here) |
pages |
1 (+0) | - | This request | (source) | Should be text lines, perhaps. | No | Neither Google nor Microsoft previewers work, though I notice if I try to open it, I can access a preview image () |
Just noting a conversation on Slack, regarding a new stats script.
+1 csv files are likely to become more common following recent ICO guidance to authorities.
Originally suggested for Open Document Formats:
This may not be worth doing, since I'm not sure how often it happens, but it has happened at least once (although that request was, unfortunately, a SAR, and has been hidden).
.odt
,.ods
…Such files are recommended by Government.
SEO: view as HTML