Closed sebbacon closed 10 years ago
Some possible mitigation strategies for current approach, current bugs etc:
Install more recent poppler-utils e.g. 0.12.0 can definitely convert this to HTML, extacting the images: http://www.whatdotheyknow.com/request/13903/response/36117/attach/html/4/FOI%20beaver%20site%20species%20audit%20SNH%20review%20of%20proposal%20redact.pdf.html Really need a "pdftk -nodrm" to remove compression from encrypted PDFs, so strips emails from e.g. http://www.whatdotheyknow.com/request/14414/response/38590/attach/html/3/090807%20FOI.pdf.html
... this misses a whole page out (someone emailed us) http://www.whatdotheyknow.com/request/unredacted_expense_claims_for_jo#incoming-49674
Worth doing View as HTML ourselves for .docx, .ppt, .tif (covered now by Google Docs) View as HTML for .txt requested
Failed to detect attachments are emails and decode them: http://www.whatdotheyknow.com/request/malicious_communication_act#incoming-12964
When indexing .docx do you need to index docProps/custom.xml and docProps/app.xml as well as word/document.xml ? (thread on xapian-discuss does so)
Consider using odt2txt or unoconv http://www-verimag.imag.fr/~moy/opendocument/
VSD files vsdump - example in zip file http://www.whatdotheyknow.com/request/dog_control_orders#incoming-3510 doing file RESPONSE/Internal documents/Briefing with Contact Islington/Contact Islington Flowchart Jul 08.vsd content type
Search for other file extensions that we have now and look for ones we could and should be indexing (call IncomingMessage.find_all_unknown_mime_types to find them - needs updating to do it in clumps as all requests won't load in RAM now )
Render HTML alternative rather than text (so tables look good) e.g.: http://www.whatdotheyknow.com/request/parking_policy
These attachment.bin files should come out as winmail.dat and be parsed by existing TNEF code. For some reason though TMail doesn't get the right content-type out of them. Not sure why. http://www.whatdotheyknow.com/request/acting_up_in_a_higher_rank
Make HTML attachments have view as HTML :) http://www.whatdotheyknow.com/request/enforced_medication#incoming-7395
Some other pdftohtml bugs (fix them or file about them) http://www.whatdotheyknow.com/request/sale_of_public_land#incoming-8146 http://www.whatdotheyknow.com/request/childrens_database_compliance_wi#incoming-8088 http://www.whatdotheyknow.com/request/3326/response/7701/attach/html/2/Scan001.PDF.pdf.html http://www.whatdotheyknow.com/request/risk_log#incoming-8090 (bad tables) http://www.whatdotheyknow.com/request/4635/response/11248/attach/html/4/FOI%20request.pdf.html (bad table) Orientation wrong: http://www.whatdotheyknow.com/request/3153/response/7726/attach/html/2/258850.pdf.html Bug in wvHtml, segfaults when converting this: http://www.whatdotheyknow.com/request/subject_access_request_guide_sar#incoming-10242
Images aren't coming out here http://www.whatdotheyknow.com/request/33682/response/83455/attach/html/3/100428%20Reply%201519%2010.doc.html
Doesn't detect doc type of a few garbage results in this list right: http://www.whatdotheyknow.com/search/UWE
.tif files are hard for people to view as multi page, consider automatically separating out the pages as separate links (to .png files or whatever) http://www.whatdotheyknow.com/request/windsor_maidenhead_council_commo#incoming-1910 Heck, may as well give thumbnails of all images, indeed all docs while you're at it :)
Another knackered HTML conversion: http://www.whatdotheyknow.com/request/registered_pharmacists_prescribi#incoming-245446
Just to emphasise that tables that really need the HTML alternative are quite common, e.g.:
http://www.whatdotheyknow.com/request/it_support_services_1295#incoming-258044
http://www.whatdotheyknow.com/request/it_support_services_347#incoming-258014
http://www.whatdotheyknow.com/request/it_support_services_1236#incoming-257000
Replaced by #1529 #1528 #1527 - if we've missed any substantive issues please make new specific tickets @hsenag
See https://github.com/sebbacon/alaveteli/wiki/Improved-document-conversion
Covers:
Not urgent but would make things look and feel smoother