mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
387 stars 195 forks source link

Generate HTML previews for more filetypes, improve icons #7027

Open WilliamWDTK opened 2 years ago

WilliamWDTK commented 2 years ago

Originally suggested for Open Document Formats:

This may not be worth doing, since I'm not sure how often it happens, but it has happened at least once (although that request was, unfortunately, a SAR, and has been hidden).

.odt, .ods

Such files are recommended by Government.

SEO: view as HTML

garethrees commented 2 years ago

Might be something that we can render through the Google Docs viewer.

https://github.com/mysociety/alaveteli/blob/0.40.0.0/app/models/foi_attachment.rb#L257-L274

WilliamWDTK commented 2 years ago

Also the current icon on the WDTK view is just a question mark:

(from https://www.whatdotheyknow.com/request/the_failures_of_dumfries_and_gal#incoming-1501531)

rather than something like the Word icon:

FOIMonkey commented 2 years ago

+1 Most City of London Police responses are .odt files and it's a pain to have to download them every time

FOIMonkey commented 2 years ago

Also North Lincs Council.

garethrees commented 2 years ago

Google Docs viewer doesn't work, but Microsoft's Office Live View does.

Adding this should be straightforward:

  1. Add an AttachmentToHTML::Adapters::OfficeLiveViewViewer or whatever we want to call it
  2. Add FoiAttachment#has_office_live_view_viewer? with the correct content type mappings
  3. Add a conditional to AttachmentToHTML#fallback_adapter_for to wire up the adapter.
WilliamWDTK commented 2 years ago

Microsoft's Live View also seems nicer for XLSX files:

Google vs Microsoft

(from this request)

Google handles .csv like this (not ideal), Microsoft doesn't seem to. (From here).

I've taken the liberty of changing the title, which I hope won't be a problem…

WilliamWDTK commented 2 years ago

For CSV, there's also things like this client-side JS viewer.

WilliamWDTK commented 2 years ago

Prevalence of file types (according to searches of the format https://www.whatdotheyknow.com/search/filetype:EXTENSION/all)

Extension Count (about…) Note Example Current Icon Suggested icon improvement Previewable? Preview suggestion
csv 48 I think these might sometimes end up as .csv.txt, but I'm not sure how to find those… (see https://github.com/mysociety/alaveteli/issues/6480) This response to this request Question mark icon (source) Change to like that for .xls, but with CSV instead of X No (insert suggestion here)
txt 8000 This mostly seems to duplicate the request email - Lines of text icon (source) Perhaps add TXT? Yes/no (please select) (insert suggestion here)
xls 9700 - - Spreadsheet icon with X (source) I don't think we need to distinguish between .xls and .xlsx Yes/no (please select) (insert suggestion here)
xlsx 17000 - - Spreadsheet icon with X (source) - Yes/no (please select) (insert suggestion here)
doc 67000 - - Lines of text icon with W (source) - Yes/no (please select) (insert suggestion here)
docx 31000 - - Lines of text icon with W (source) - Yes/no (please select) (insert suggestion here)
pdf 220000 - - Lines of text icon with PDF (source) - Yes/no (please select) (insert suggestion here)
xml 24 - This unit diagram file (almost 9 MB) in response to this request or this one in some unknown format in response to this other request Question mark icon (source) Perhaps make it like that for .html, but put XML below the angled brackets, rather than HTML Yes/no (please select) (insert suggestion here)
jpg 40000 - - Mountain with sun icon (source) Include the extension below the mountains? Yes/no (please select) (insert suggestion here)
png 29000 - - Mountain with sun icon (source) Include the extension below the mountains? Yes/no (please select) (insert suggestion here)
switch 191 Our favourite! - Binary digits icon (1101 0010 1011) (source) An icon with a padlocked envelope? Or a frowning face(!) Yes/no (please select) (insert suggestion here)
mp3 10 - This recording from this request Question mark icon (source) Soundwave? Yes/no (please select) (insert suggestion here)
mp4 1 - This 8MB video demonstrating how to update the TTK app (railway ticketing) on a smartphone in response to this request to LNER Question mark icon (source) Camcorder icon, video casette or a clapperboard? Yes/no (please select) (insert suggestion here)
wav 3 - This announcement fragment released in response to this request to SE Trains Limited Question mark icon (source) Soundwave? Yes/no (please select) (insert suggestion here)
zip 1900 - This ZIP (7.6MB) released in response to the same request as above Checquer representing zip teeth icon (source) Indicate with text ZIP Yes/no (please select) (insert suggestion here)
gz 1 - This file in response to this request for web server access logs Question mark icon (source) Like .zip, but with text GZ? Yes/no (please select) (insert suggestion here)
html 4500 - - Angled bracket with HTML icon (source) Yes/no (please select) (insert suggestion here)
tif 342 - This diagram of a concourse in Gateshead released in response to this request Mountain with sun icon with TIFF (source) Yes/no (please select) (insert suggestion here)
odp 1 - This governance management structure diagram from this request Question mark icon (source) Like .ppt/.pptx but with text ODP Yes/no (please select) (insert suggestion here)
odt 360 - - Question mark icon (source) Like .doc/.docx with ODT instead of W? Yes/no (please select) (insert suggestion here)
ods 49 - - Question mark icon (source) Like .xls/.xlsx with ODS instead of X? Yes/no (please select) (insert suggestion here)
rar 4 - This file released in response to this request to the ICO Question mark icon (source) Like with .zip, but with RAR text Yes/no (please select) (insert suggestion here)
ppt 242 - - Slide screen and stand with play symbol icon (source) Add text PPT Yes/no (please select) (insert suggestion here)
pptx 295 - - Slide screen and stand with play symbol icon (source) Add text PPT Yes/no (please select) (insert suggestion here)
rtf 1400 - This request Question mark icon (source) Should be text lines? Thought I saw that icon in the code… Yes/no (please select) (insert suggestion here)
pages 1 - This request Question mark icon (source) Should be text lines, perhaps. No Neither Google nor Microsoft previewers work, though I notice if I try to open it, I can access a preview image (preview)
FOIMonkey commented 2 years ago

@WilliamWDTK That table is fascinating, thanks for putting it together. It's interesting to see that we're closing in on 250,000 PDFs.

There are also some other extension types that you can't use the search to find (mainly shapefiles) eg https://www.whatdotheyknow.com/request/school_catchment_areas_2#incoming-619600 or https://www.whatdotheyknow.com/request/public_rights_of_way_data_4#incoming-2023061

mdeuk commented 2 years ago

Google Docs viewer doesn't work, but Microsoft's Office Live View does.

Adding this should be straightforward:

1. Add an `AttachmentToHTML::Adapters::OfficeLiveViewViewer` or whatever we want to call it

2. Add `FoiAttachment#has_office_live_view_viewer?` with the correct content type mappings

3. Add a conditional to `AttachmentToHTML#fallback_adapter_for` to wire up the adapter.

We'll probably want a WDTK theme ticket here for any privacy notice changes - as we'd be presenting this in a WDTK branded window, so we need to make the user aware of the cookies being set by this service.

A workaround could perhaps be to simply link directly to it, rather than wrapping around an iframe - but that might be a bit clunky. Worth noting, perhaps, that we likely should have a similar disclosure for Google Docs viewer - which sets cookies too - and like the Microsoft variety, doesn't give any visible indication at time of loading.

I've not raised such a ticket yet, as I'm not sure how close we are to this being implemented.

garethrees commented 2 years ago

I've not raised such a ticket yet, as I'm not sure how close we are to this being implemented.

No immediate plans, so this can be created when we start implementing.

WilliamWDTK commented 2 years ago

Not sure if this is the appropriate ticket to raise this on, but for this attachment here, the Google preview has mucked up the formatting, while Microsoft's previewer keeps it nice.

WilliamWDTK commented 2 years ago

.eml and .msg

I think this was raised elsewhere, possibly by @FOIMonkey.


@RichardTaylor mentioned in #3338. Not sure why I thought it was @FOIMonkey.

mdeuk commented 2 years ago

For CSV, there's also things like this client-side JS viewer.

I'd actually quite like us to look at Datasette [^1] - the 'lite' version is entirely browser based, but it might be trivial to spin up an instance of it. It's a powerful tool for data analysis, and given much of the open data we've got in CSVs can be rather bulky, this seems a logical way to present things with minimal overhead.

The product is already used elsewhere in mySociety [^2], so it'd be interesting to look at what it'd take to have Alaveteli serve files here - do we send the right headers / flags^3 out?

[^1]: A recent claim to fame (or infamy!) for this is @simonw's ScotRail random apology generator. [^2]: e.g. https://mysociety.github.io/wdtk_authorities_list/datasets/whatdotheyknow_authorities_dataset/latest

garethrees commented 2 years ago

I'd actually quite like us to look at Datasette

Me too! https://twitter.com/_gareth/status/1521918004690399232. I did spike this out, but at the time Datasette lite didn't accept a CSV URL, only an sqlite URL, so that was annoying. Seems like it would be pretty easy to add now.

commit 3ae66c5825bd3156d24f5230b5cc49e79488800e
Author: Gareth Rees <gareth@>
Date:   Wed May 4 20:09:24 2022 +0100

    Basically working CSV to SQL

    Downloads as "foo.csv", but then renaming to "foo.sqlite" works as
    expected:

        sqlite> select * from foo;
        1|2
        3|4

diff --git a/Gemfile b/Gemfile
index 1e93ef0d6..87876c613 100644
--- a/Gemfile
+++ b/Gemfile
@@ -175,6 +175,8 @@ if rails_upgrade? && RUBY_VERSION < '3.1'
   gem 'uri', '0.10.0'
 end

+gem 'csv_to_sqlite3', require: 'csv_to_sqlite'
+
 group :test do
   gem 'fivemat', '~> 1.3.7'
   gem 'webmock', '~> 3.14.0'
diff --git a/Gemfile.lock b/Gemfile.lock
index 332c72c80..4a3ac6dcb 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -169,6 +169,9 @@ GEM
     crack (0.4.5)
       rexml
     crass (1.0.6)
+    csv_to_sqlite3 (0.1.4)
+      sqlite3 (~> 1.4, >= 1.4.2)
+      tty-progressbar
     daemons (1.4.0)
     dalli (3.2.2)
     dante (0.2.0)
@@ -467,7 +470,9 @@ GEM
       actionpack (>= 5.2)
       activesupport (>= 5.2)
       sprockets (>= 3.0.0)
+    sqlite3 (1.4.2)
     statistics2 (0.54)
+    strings-ansi (0.2.0)
     stripe (5.55.0)
     syslog_protocol (0.9.2)
     text (1.3.1)
@@ -478,6 +483,13 @@ GEM
     thor (1.2.1)
     tilt (2.0.10)
     trailblazer-option (0.1.2)
+    tty-cursor (0.7.1)
+    tty-progressbar (0.18.2)
+      strings-ansi (~> 0.2)
+      tty-cursor (~> 0.7)
+      tty-screen (~> 0.8)
+      unicode-display_width (>= 1.6, < 3.0)
+    tty-screen (0.8.1)
     tzinfo (2.0.4)
       concurrent-ruby (~> 1.0)
     uber (0.1.0)
@@ -529,6 +541,7 @@ DEPENDENCIES
   capistrano (~> 2.15.0, < 3.0.0)
   capybara (~> 3.37.1)
   charlock_holmes (~> 0.7.7)
+  csv_to_sqlite3
   dalli (~> 3.2.2)
   exception_notification (~> 4.5.0)
   factory_bot_rails (~> 6.2.0)
@@ -602,3 +615,6 @@ DEPENDENCIES
   xapian-full-alaveteli (~> 1.4.18.1)
   xml-simple (~> 1.1.9)
   zip_tricks (~> 5.6.0)
+
+BUNDLED WITH
+   2.1.4
diff --git a/app/controllers/attachments_controller.rb b/app/controllers/attachments_controller.rb
index c64174f82..a3524ca03 100644
--- a/app/controllers/attachments_controller.rb
+++ b/app/controllers/attachments_controller.rb
@@ -13,7 +13,7 @@ class AttachmentsController < ApplicationController

   include ProminenceHeaders

-  around_action :cache_attachments
+  around_action :cache_attachments, except: :show_as_sqlite

   before_action :authenticate_attachment
   before_action :authenticate_attachment_as_html, only: :show_as_html
@@ -36,6 +36,29 @@ def show
     render body: body, content_type: content_type
   end

+  def show_as_sqlite
+    raise StandardError unless content_type == 'text/csv'
+
+    body = @incoming_message.apply_masks(
+      @attachment.default_body,
+      @attachment.content_type
+    )
+
+    CsvToSqlite::Database.class_eval do
+      def target_dir
+        return './spec/src/' if ENV.key? "test"
+        './tmp'
+      end
+    end
+
+    tmp = "./tmp/#{params[:file_name]}"
+    File.write(tmp, body)
+    CsvToSqlite::CsvToSQLite.new([tmp]).run
+    sql_body = File.read('./tmp/data.sqlite3')
+
+    render body: sql_body, content_type: 'application/octet-stream'
+  end
+
   def show_as_html
     # images made during conversion (e.g. images in PDF files) are put in the
     # cache directory, so the same cache code in cache_attachments above will
diff --git a/config/routes.rb b/config/routes.rb
index ed94afd31..7f5aa1471 100644
--- a/config/routes.rb
+++ b/config/routes.rb
@@ -110,6 +110,10 @@
         :as => :similar_request,
         :via => :get

+  match '/request/:id/response/:incoming_message_id/attach/sqlite/:part/*file_name' => 'attachments#show_as_sqlite',
+        :format => false,
+        :via => :get
+
   match '/request/:id/response/:incoming_message_id/attach/html/:part/*file_name' => 'attachments#show_as_html',
         :format => false,
         :as => :get_attachment_as_html,

commit cfa3bfcd2022b45f2bfe76e69a332957e689f762
Author: Gareth Rees <gareth@>
Date:   Wed May 4 19:33:02 2022 +0100

    Use correct mime type for cached CSV attachments

    Currently they fall back to application/octet-stream in
    AttachmentsController#content_type

diff --git a/lib/alaveteli_file_types.rb b/lib/alaveteli_file_types.rb
index ae23ffc9f..744270254 100644
--- a/lib/alaveteli_file_types.rb
+++ b/lib/alaveteli_file_types.rb
@@ -2,6 +2,7 @@ class AlaveteliFileTypes
   # To add an image, create a file with appropriate name corresponding to the
   # mime type in app/assets/images/content_type/ e.g. icon_image_tiff_large.png
   FileExtensionToMimeType = {
+    'csv' => 'text/csv',
     "txt" => 'text/plain',
     "pdf" => 'application/pdf',
     "rtf" => 'application/rtf',

I just tried two WDTK CSVs (example) and both resulted in a Datasette error:

Traceback (most recent call last):
  File "/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/lib/python3.10/site-packages/_pyodide/_base.py", line 500, in eval_code_async
    await CodeRunner(
  File "/lib/python3.10/site-packages/_pyodide/_base.py", line 353, in run_async
    await coroutine
  File "<exec>", line 53, in <module>
  File "/lib/python3.10/site-packages/pyodide/http.py", line 231, in pyfetch
    url, await _jsfetch(url, to_js(kwargs, dict_converter=Object.fromEntries))
  File "/lib/python3.10/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception
JsException: TypeError: Failed to fetch

Perhaps we need to apply cfa3bfcd2022b45f2bfe76e69a332957e689f762 from the diff above to make this work.

garethrees commented 2 years ago

Still getting the failure even after #7239.

Looking at the previous example WDTK attachment, it may not be parsable as CSV – they've just saved an XLS as CSV.

This example ought to work, but still fails with the same error.

Not sure if there's anything weird going on with how we're responding to the HTTP request that would cause the failure?

$ curl -I https://www.whatdotheyknow.com/request/835632/response/2024608/attach/2/FOI%20Disposals%2025%20Feb%202022.orig%201.csv
HTTP/2 200
server: nginx
date: Thu, 08 Sep 2022 08:27:22 GMT
content-type: text/csv
content-length: 40049
last-modified: Thu, 08 Sep 2022 08:23:40 GMT
etag: "9c71-5e82625980b7c"
x-url: /request/835632/response/2024608/attach/2/FOI%20Disposals%2025%20Feb%202022.orig%201.csv
age: 0
accept-ranges: bytes
mdeuk commented 2 years ago

Not sure if there's anything weird going on with how we're responding to the HTTP request that would cause the failure?

We don't appear to be passing the header access-control-allow-origin: * - which is required here. I assume we'll need a sysadmin ticket?

garethrees commented 2 years ago

We don't appear to be passing the header access-control-allow-origin: *

Doh, RTFM Gareth. AIUI we can do that in Rails on a per-action basis. Don't know if there are security implications.

WilliamWDTK commented 1 year ago

Updated prevalence of file types (according to searches of the format https://www.whatdotheyknow.com/search/filetype:EXTENSION/all)

I can't remember the exact method, but here I'm just scrolling to the last page of results and seeing what the count ends up as. This may affect the values, as the count on the first page doesn't always match that on the last. Of course, this is the count of requests that have an attachment of that type, so requests with multiple attachments of one type will only count for one, and a request with attachments of multiple types will count as one for each type present.

Extension Count (about…) Note Example Current Icon Suggested icon improvement Previewable? Preview suggestion
csv 160 (+112) I think these might sometimes end up as .csv.txt, but I'm not sure how to find those… (see https://github.com/mysociety/alaveteli/issues/6480) This response to this request Question mark icon (source) Change to like that for .xls, but with CSV instead of X No (insert suggestion here)
txt 8600 (+600) This mostly seems to duplicate the request email - Lines of text icon (source) Perhaps add TXT? Yes/no (please select) (insert suggestion here)
xls 10021 (+321) - - Spreadsheet icon with X (source) I don't think we need to distinguish between .xls and .xlsx Yes/no (please select) (insert suggestion here)
xlsx 19000 (+2000) - - Spreadsheet icon with X (source) - Yes/no (please select) (insert suggestion here)
doc 72000 (+5000) - - Lines of text icon with W (source) - Yes/no (please select) (insert suggestion here)
docx 36000 (+5000) - - Lines of text icon with W (source) - Yes/no (please select) (insert suggestion here)
pdf 250000 (+30000) - - Lines of text icon with PDF (source) - Yes/no (please select) (insert suggestion here)
xml 24 (+0) - This unit diagram file (almost 9 MB) in response to this request or this one in some unknown format in response to this other request Question mark icon (source) Perhaps make it like that for .html, but put XML below the angled brackets, rather than HTML Yes/no (please select) (insert suggestion here)
jpg 50000 (+10000) - - Mountain with sun icon (source) Include the extension below the mountains? Yes/no (please select) (insert suggestion here)
png 40000 (+11000) - - Mountain with sun icon (source) Include the extension below the mountains? Yes/no (please select) (insert suggestion here)
switch 235 (+44) Our favourite! - Binary digits icon (1101 0010 1011) (source) An icon with a padlocked envelope? Or a frowning face(!) Yes/no (please select) (insert suggestion here)
mp3 10 (+0) - This recording from this request Question mark icon (source) Soundwave? Yes/no (please select) (insert suggestion here)
mp4 1 (+0) - This 8MB video demonstrating how to update the TTK app (railway ticketing) on a smartphone in response to this request to LNER Question mark icon (source) Camcorder icon, video casette or a clapperboard? Yes/no (please select) (insert suggestion here)
wav 5 (+2) - This announcement fragment released in response to this request to SE Trains Limited Question mark icon (source) Soundwave? Yes/no (please select) (insert suggestion here)
zip 2006 (+106) - This ZIP (7.6MB) released in response to the same request as above Checquer representing zip teeth icon (source) Indicate with text ZIP Yes/no (please select) (insert suggestion here)
gz 1 (+0) - This file in response to this request for web server access logs Question mark icon (source) Like .zip, but with text GZ? Yes/no (please select) (insert suggestion here)
html 4983 (+483) - - Angled bracket with HTML icon (source) Yes/no (please select) (insert suggestion here)
tif 348 (+6) - This diagram of a concourse in Gateshead released in response to this request Mountain with sun icon with TIFF (source) Yes/no (please select) (insert suggestion here)
odp 3 (+2) - This governance management structure diagram from this request Question mark icon (source) Like .ppt/.pptx but with text ODP Yes/no (please select) (insert suggestion here)
odt 558 (+198) - - Question mark icon (source) Like .doc/.docx with ODT instead of W? Yes/no (please select) (insert suggestion here)
ods 56 (+7) - - Question mark icon (source) Like .xls/.xlsx with ODS instead of X? Yes/no (please select) (insert suggestion here)
rar 4 (+0) - This file released in response to this request to the ICO Question mark icon (source) Like with .zip, but with RAR text Yes/no (please select) (insert suggestion here)
ppt 245 (+3) - - Slide screen and stand with play symbol icon (source) Add text PPT Yes/no (please select) (insert suggestion here)
pptx 332 (+37) - - Slide screen and stand with play symbol icon (source) Add text PPT Yes/no (please select) (insert suggestion here)
rtf 1649 (+249) - This request Question mark icon (source) Should be text lines? Thought I saw that icon in the code… Yes/no (please select) (insert suggestion here)
pages 1 (+0) - This request Question mark icon (source) Should be text lines, perhaps. No Neither Google nor Microsoft previewers work, though I notice if I try to open it, I can access a preview image (preview)
mdeuk commented 1 year ago

Just noting a conversation on Slack, regarding a new stats script.

HelenWDTK commented 11 months ago

+1 csv files are likely to become more common following recent ICO guidance to authorities.

garethrees commented 11 months ago

Looks like CSV is previewable in Google Docs Viewer with some degree of success.