Closed simonw closed 5 years ago
This could also be handled by a plugin.
Possible plugin direction:
import filetype # https://pypi.org/project/filetype/
@hookimpl(trylast=True)
def render_cell(value):
if isinstance(value, bytes):
info = repr(value)
# May still want to truncate this on table view (but not on row page)
guess = filetype.guess(value)
if guess is not None:
# Need jinja2 markup here for \n to display
info = "Guess: mime={}, extension={}\n\n{}".format(
guess.mime, guess.extension, info
)
return info
return None
What are some other interesting tricks we can use to make binary data a bit more interesting to look at?
https://martin.varela.fi/2017/09/09/simple-binary-data-visualization/ has some really clever visualization tricks - probably a bit much for this plugin though. See also https://codisec.com/binary-visualization-explained/
https://github.com/tryexceptpass/perceptio is some much simpler code for rendering an image for a binary.
Another cheap trick is the equivalent of the Unix strings
command - https://stackoverflow.com/questions/6804582/extract-strings-from-a-binary-file-in-python
This is quite nice:
$ od -c /tmp/Thumb64Segment_11.data | head -n 10
0000000 \0 \0 @ \0 \0 \0 005 5 X T S F \0 \0 \0 001
0000020 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
0010000 025 030 . 377 026 032 . 377 027 033 - 377 031 035 0 377
0010020 032 036 1 377 036 5 377 037 ! 8 377 036 " : 377
0010040 036 8 377 $ : 377 ! & ; 377 $ * ? 377
0010060 ' - ? 377 % * < 377 % , > 377 - 3 E 377
0010100 6 ; M 377 : @ O 377 = C R 377 @ G V 377
0010120 @ I X 377 < B Q 377 8 @ N 377 8 @ P 377
0010140 : C T 377 ; C U 377 : C V 377 9 C W 377
Here's a rough Python equivalent http://code.activestate.com/recipes/579120-data_dumppy-like-the-unix-od-octal-dump-command/
New idea: show essentially this but differentiate the escape sequences in some way. Maybe wrap them in <code>
or put the non-escape sequences in bold?
I'm going to call this datasette-render-binary
: https://github.com/simonw/datasette-render-binary
Shipped 0.1 of the plugin! I'm pretty happy with this display format:
If you don't mind calling out to Java, then Apache Tika is able to tell you what a load of "binary stuff" is, plus render it to XHTML where possible.
There's a python wrapper around the Apache Tika server, but for a more typical datasette usecase you'd probably just want to grab the Tika CLI jar, and call it with --detect
and/or --xhtml
to process the unknown binary blob
Calling out to Tika does make me a little nervous, but that's why Datasette has plugins! A plugin that calls Tika (and caches the results) could be really interesting.
In #442 we suppressed rendering of binary data:
It turns out there is one use-case where displaying binary data is useful: when you're poking around looking at random SQLite databases you find in
~/Library
trying to figure out what they are for.So, a mechanism for opting in to ugly display of binary data again would be useful.