simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.61k stars 690 forks source link

?sort=colname~numeric to sort by by column cast to real #894

Open simonw opened 4 years ago

simonw commented 4 years ago

If a text column actually contains numbers, being able to "sort by column, treated as numeric" would be really useful.

Probably depends on column actions enabled by #690

simonw commented 4 years ago

This needs querystring parameter design. Some options:

simonw commented 4 years ago

I could even let plugins define new sort types. Imagine a plugin that enables this:

?_sort_date_desc=mycol - where it knows how to handle specific date formats, or even uses dateutil.parser.parse.

simonw commented 4 years ago

Sorting by date when the column has a junk date format in it is such a column need it should maybe ship in Datasette by default - though I've been trying to avoid adding heavy dependencies like dateutil if I can get away with it.

simonw commented 4 years ago

This does feel like a weird plugin hook just because there aren't really THAT many different use-cases that plugins could solve. The ones I can think of are:

Could this work if I just allow _sort_clause=?

One possible solution for the no-arbitrary-SQL case: users can define sort orders in metadata.json/yml. So if you want to enable sort-by-distance without enabling arbitrary SQL you could add something like this:

databases:
  mydb:
    tables:
      museums:
        sort_clause: bm25(fts)
simonw commented 4 years ago

There's something interesting about figuring out which sort options should be offered in the column actions menu.

Two options:

simonw commented 4 years ago

The simplest solution would be for Python code to scan all of the visible values on the current page and show the column action for "sort by this numeric" based purely on that. I already do that in the JavaScript for "are there any blank values in the first page?" here:

https://github.com/simonw/datasette/blob/4f7c0ebd85ccd8c1853d7aa0147628f7c1b749cc/datasette/static/table.js#L106-L118

simonw commented 4 years ago

I think the first version of this feature involves implementing ?_sort_numeric=col and ?_sort_numeric_desc=col plus the JavaScript to detect if those values should be shown in the column actions menu.

One question: how to reflect that this is happening in the current sort UI. This menu here for example:

fixtures__compound_three_primary_keys__1_001_rows_where_sorted_by_pk3_descending

And this interface: how should it indicate that a text is currently sorted numerically v.s. sorted alphabetically, and allow the user to switch from one to the other?

fixtures__sortable__201_rows_where_sorted_by_sortable_with_nulls_descending
simonw commented 4 years ago

For the "Sort by X" select menu case... I could automatically expand that menu to contain extra options for "Sort numerically by X" for each TEXT column in the table. That's a pretty good option.

For the action cog menu, I can add the extra options to the cog menu - and rely on the fact that the title of the page will say "Sorted numerically by colname descending".

simonw commented 4 years ago

This is enough of a design to build a working prototype.

simonw commented 4 years ago

The Sort by <select> menu needs a rethink: https://github.com/simonw/datasette/blob/4f7c0ebd85ccd8c1853d7aa0147628f7c1b749cc/datasette/templates/table.html#L80-L87

If it's going to include sort by numeric options it needs a different format - since ?_sort=colname needs to also support ?_sort=numeric_colname - but shouldn't clash with a weird column already named numeric_colname.

Can I come up with a value syntax for this that is guaranteed not to clash with a weirdly named existing column?

I think so. I could use {"type": "numeric", "column": "mycolumn"} as the value - then if a column has a name that is itself valid JSON (weird but possible) the _sort= value would be {"type": "default", "column": "{\"type\": \"numeric\", \"name\": \"column\"}"}.

simonw commented 4 years ago

Simpler option: ?_sort= column values look like this:

simonw commented 4 years ago

Even better solution: use URL encoding in the parameter details. This is consistent with how ?_next= tokens work, e.g. ?_next=0.291861560261786%2Ce%2Cj.

So the format can be:

For most columns this will look like: ?_sort=score$numeric

For columns with a $ in their name it will be ?_sort=score%24hasdollar$numeric

Problem: both $ and , are usually URL encoded anyway. I need a character which isn't encoded by default, so that I can use its encoded form to show it is part of the column name and its un-encoded form to split the cast indicator.

_ is a candidate here - not encoded by default, but can be encoded as %5F.

The other unreserved non-alphanumeric characters are -, ., _, ~.

Of these, ~ is least likely to show up in a column name. So I'll use that.

simonw commented 4 years ago

Urgh this isn't going to work. %7E~%7E gets decoded as ~~~ so I wouldn't be able to tell the difference.

I could use double-percentage-encoding here instead. I feel like there's a simpler solution that I'm missing (and that may well be in use within Datasette already, I'm not doing great thinking this morning).

simonw commented 4 years ago

Much easier solution: if the suffix is ~numeric then treat it as the column name sorted numerically. If the suffix is missing OR the suffix is ~default, sort without casting. Only add the ~default suffix if the column name itself contains at least one ~ symbol.

Using ~ because it doesn't need to be URL-encoded.

simonw commented 4 years ago

Relevant code: https://github.com/simonw/datasette/blob/4f7c0ebd85ccd8c1853d7aa0147628f7c1b749cc/datasette/views/table.py#L485-L510

simonw commented 4 years ago

Something to watch out for: "" empty strings cast to 0.0:

select cast("100" as real), "100", cast(null as real), cast("" as real)

cast("100" as real) "100" cast(null as real) cast("" as real)
100.0 100   0.0

https://latest.datasette.io/fixtures?sql=select+cast%28%22100%22+as+real%29%2C+%22100%22%2C+cast%28null+as+real%29%2C+cast%28%22%22+as+real%29

simonw commented 4 years ago

cast(nullif(colname, '') as real) can fix this - it will treat '' the same as null.

simonw commented 4 years ago

Also need to rethink this template logic that decides if to show a column as sorted or not: https://github.com/simonw/datasette/blob/4f7c0ebd85ccd8c1853d7aa0147628f7c1b749cc/datasette/templates/_table.html#L10-L14

simonw commented 4 years ago

Prototype so far:

diff --git a/datasette/views/table.py b/datasette/views/table.py
index ea11a51..d61f8bd 100644
--- a/datasette/views/table.py
+++ b/datasette/views/table.py
@@ -497,17 +497,32 @@ class TableView(RowTableShared):
         if sort and sort_desc:
             raise DatasetteError("Cannot use _sort and _sort_desc at the same time")

+        def parse_sort(sort):
+            if "~" in sort:
+                if sort.endswith("~default"):
+                    col = sort.rsplit("~", 1)[0]
+                    return col, escape_sqlite(col)
+                elif sort.endswith("~numeric"):
+                    col = sort.rsplit("~", 1)[0]
+                    return col, "cast(nullif({}, '') as real)".format(escape_sqlite(col))
+                else:
+                    return sort, escape_sqlite(sort)
+            else:
+                return sort, escape_sqlite(sort)
+
         if sort:
-            if sort not in sortable_columns:
-                raise DatasetteError("Cannot sort table by {}".format(sort))
+            sort_column, sort_clause = parse_sort(sort)
+            if sort_column not in sortable_columns:
+                raise DatasetteError("Cannot sort table by {}".format(sort_column))

-            order_by = escape_sqlite(sort)
+            order_by = sort_clause

         if sort_desc:
-            if sort_desc not in sortable_columns:
-                raise DatasetteError("Cannot sort table by {}".format(sort_desc))
+            sort_column, sort_clause = parse_sort(sort_desc)
+            if sort_column not in sortable_columns:
+                raise DatasetteError("Cannot sort table by {}".format(sort_column))

-            order_by = "{} desc".format(escape_sqlite(sort_desc))
+            order_by = "{} desc".format(sort_clause)

         from_sql = "from {table_name} {where}".format(
             table_name=escape_sqlite(table),
simonw commented 3 years ago

Maybe ?_sort_numeric=col and ?_sort_numeric_desc=col would be better here.

simonw commented 3 years ago

I could add these sorting links to the cog menu for any TEXT columns.