mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Improve document conversion #18

Closed sebbacon closed 10 years ago

sebbacon commented 13 years ago

See https://github.com/sebbacon/alaveteli/wiki/Improved-document-conversion

Covers:

Not urgent but would make things look and feel smoother

sebbacon commented 13 years ago

Some possible mitigation strategies for current approach, current bugs etc:

sebbacon commented 13 years ago

Install more recent poppler-utils e.g. 0.12.0 can definitely convert this to HTML, extacting the images: http://www.whatdotheyknow.com/request/13903/response/36117/attach/html/4/FOI%20beaver%20site%20species%20audit%20SNH%20review%20of%20proposal%20redact.pdf.html Really need a "pdftk -nodrm" to remove compression from encrypted PDFs, so strips emails from e.g. http://www.whatdotheyknow.com/request/14414/response/38590/attach/html/3/090807%20FOI.pdf.html

... this misses a whole page out (someone emailed us) http://www.whatdotheyknow.com/request/unredacted_expense_claims_for_jo#incoming-49674

sebbacon commented 13 years ago

Worth doing View as HTML ourselves for .docx, .ppt, .tif (covered now by Google Docs) View as HTML for .txt requested

sebbacon commented 13 years ago

Failed to detect attachments are emails and decode them: http://www.whatdotheyknow.com/request/malicious_communication_act#incoming-12964

sebbacon commented 13 years ago

When indexing .docx do you need to index docProps/custom.xml and docProps/app.xml as well as word/document.xml ? (thread on xapian-discuss does so)

sebbacon commented 13 years ago

Consider using odt2txt or unoconv http://www-verimag.imag.fr/~moy/opendocument/

sebbacon commented 13 years ago

VSD files vsdump - example in zip file http://www.whatdotheyknow.com/request/dog_control_orders#incoming-3510 doing file RESPONSE/Internal documents/Briefing with Contact Islington/Contact Islington Flowchart Jul 08.vsd content type

sebbacon commented 13 years ago

Search for other file extensions that we have now and look for ones we could and should be indexing (call IncomingMessage.find_all_unknown_mime_types to find them - needs updating to do it in clumps as all requests won't load in RAM now )

sebbacon commented 13 years ago

Render HTML alternative rather than text (so tables look good) e.g.: http://www.whatdotheyknow.com/request/parking_policy

sebbacon commented 13 years ago

These attachment.bin files should come out as winmail.dat and be parsed by existing TNEF code. For some reason though TMail doesn't get the right content-type out of them. Not sure why. http://www.whatdotheyknow.com/request/acting_up_in_a_higher_rank

sebbacon commented 13 years ago

Make HTML attachments have view as HTML :) http://www.whatdotheyknow.com/request/enforced_medication#incoming-7395

sebbacon commented 13 years ago

Knackered view as HTML: http://www.whatdotheyknow.com/request/1385/response/5483/attach/html/3/Response%20465.2008.pdf.html

sebbacon commented 13 years ago

Some other pdftohtml bugs (fix them or file about them) http://www.whatdotheyknow.com/request/sale_of_public_land#incoming-8146 http://www.whatdotheyknow.com/request/childrens_database_compliance_wi#incoming-8088 http://www.whatdotheyknow.com/request/3326/response/7701/attach/html/2/Scan001.PDF.pdf.html http://www.whatdotheyknow.com/request/risk_log#incoming-8090 (bad tables) http://www.whatdotheyknow.com/request/4635/response/11248/attach/html/4/FOI%20request.pdf.html (bad table) Orientation wrong: http://www.whatdotheyknow.com/request/3153/response/7726/attach/html/2/258850.pdf.html Bug in wvHtml, segfaults when converting this: http://www.whatdotheyknow.com/request/subject_access_request_guide_sar#incoming-10242

Images aren't coming out here http://www.whatdotheyknow.com/request/33682/response/83455/attach/html/3/100428%20Reply%201519%2010.doc.html

Doesn't detect doc type of a few garbage results in this list right: http://www.whatdotheyknow.com/search/UWE

sebbacon commented 13 years ago

.tif files are hard for people to view as multi page, consider automatically separating out the pages as separate links (to .png files or whatever) http://www.whatdotheyknow.com/request/windsor_maidenhead_council_commo#incoming-1910 Heck, may as well give thumbnails of all images, indeed all docs while you're at it :)

hsenag commented 12 years ago

Another knackered HTML conversion: http://www.whatdotheyknow.com/request/registered_pharmacists_prescribi#incoming-245446

hsenag commented 12 years ago

Just to emphasise that tables that really need the HTML alternative are quite common, e.g.:

 http://www.whatdotheyknow.com/request/it_support_services_1295#incoming-258044
 http://www.whatdotheyknow.com/request/it_support_services_347#incoming-258014
 http://www.whatdotheyknow.com/request/it_support_services_1236#incoming-257000
TomSteinberg commented 10 years ago

Replaced by #1529 #1528 #1527 - if we've missed any substantive issues please make new specific tickets @hsenag