simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.41k stars 672 forks source link

Option to display binary data #506

Closed simonw closed 5 years ago

simonw commented 5 years ago

In #442 we suppressed rendering of binary data:

many-photos-tables__RKAlbumVersion_albumId_RidIndex__36_rows

It turns out there is one use-case where displaying binary data is useful: when you're poking around looking at random SQLite databases you find in ~/Library trying to figure out what they are for.

So, a mechanism for opting in to ugly display of binary data again would be useful.

simonw commented 5 years ago

This could also be handled by a plugin.

simonw commented 5 years ago

Possible plugin direction:

import filetype # https://pypi.org/project/filetype/

@hookimpl(trylast=True)
def render_cell(value):
    if isinstance(value, bytes):
        info = repr(value)
        # May still want to truncate this on table view (but not on row page)
        guess = filetype.guess(value)
        if guess is not None:
            # Need jinja2 markup here for \n to display
            info = "Guess: mime={}, extension={}\n\n{}".format(
                guess.mime, guess.extension, info
            )
        return info

    return None
simonw commented 5 years ago

What are some other interesting tricks we can use to make binary data a bit more interesting to look at?

https://martin.varela.fi/2017/09/09/simple-binary-data-visualization/ has some really clever visualization tricks - probably a bit much for this plugin though. See also https://codisec.com/binary-visualization-explained/

https://github.com/tryexceptpass/perceptio is some much simpler code for rendering an image for a binary.

simonw commented 5 years ago

Another cheap trick is the equivalent of the Unix strings command - https://stackoverflow.com/questions/6804582/extract-strings-from-a-binary-file-in-python

simonw commented 5 years ago

This is quite nice:

$ od -c /tmp/Thumb64Segment_11.data | head -n 10
0000000   \0  \0   @  \0  \0  \0 005   5   X   T   S   F  \0  \0  \0 001
0000020   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0010000  025 030   . 377 026 032   . 377 027 033   - 377 031 035   0 377
0010020  032 036   1 377 036       5 377 037   !   8 377 036   "   : 377
0010040  036       8 377       $   : 377   !   &   ; 377   $   *   ? 377
0010060    '   -   ? 377   %   *   < 377   %   ,   > 377   -   3   E 377
0010100    6   ;   M 377   :   @   O 377   =   C   R 377   @   G   V 377
0010120    @   I   X 377   <   B   Q 377   8   @   N 377   8   @   P 377
0010140    :   C   T 377   ;   C   U 377   :   C   V 377   9   C   W 377

Here's a rough Python equivalent http://code.activestate.com/recipes/579120-data_dumppy-like-the-unix-od-octal-dump-command/

simonw commented 5 years ago

3C9CCDBA-F346-47CB-BFEC-964B0426E728

New idea: show essentially this but differentiate the escape sequences in some way. Maybe wrap them in <code> or put the non-escape sequences in bold?

simonw commented 5 years ago

I'm going to call this datasette-render-binary: https://github.com/simonw/datasette-render-binary

simonw commented 5 years ago

Shipped 0.1 of the plugin! I'm pretty happy with this display format:

many-photos-tables__RKFaceCrop__58_rows
Gagravarr commented 5 years ago

If you don't mind calling out to Java, then Apache Tika is able to tell you what a load of "binary stuff" is, plus render it to XHTML where possible.

There's a python wrapper around the Apache Tika server, but for a more typical datasette usecase you'd probably just want to grab the Tika CLI jar, and call it with --detect and/or --xhtml to process the unknown binary blob

simonw commented 5 years ago

Calling out to Tika does make me a little nervous, but that's why Datasette has plugins! A plugin that calls Tika (and caches the results) could be really interesting.