simonw / datasette-edit-schema

Datasette plugin for modifying table schemas
Apache License 2.0
16 stars 0 forks source link

Suggest type conversions for TEXT columns #13

Closed simonw closed 1 year ago

simonw commented 4 years ago

It would be great to suggest things like "the category column is all numbers but is currently TEXT, you could convert it to INTEGER"

simonw commented 4 years ago

Here's a query that can do that:

select
  'contains_non_integer' as col
from
  sortable
where
  cast(cast(sortable AS INTEGER) AS TEXT) != sortable
limit
  1

This will return 0 results if every column is an integer - it will shortcut the query and return a result as soon as it finds a non-integer column.

Example for a column containing only integers: https://latest.datasette.io/fixtures?sql=select%0D%0A++%27contains_non_integer%27+as+col%0D%0Afrom%0D%0A++sortable%0D%0Awhere%0D%0A++cast%28cast%28sortable+AS+INTEGER%29+AS+TEXT%29+%21%3D+sortable%0D%0Alimit%0D%0A++1

And for a column that instead contains floating point: https://latest.datasette.io/fixtures?sql=select%0D%0A++%27contains_non_integer%27+as+col%2C+*%0D%0Afrom%0D%0A++sortable%0D%0Awhere%0D%0A++cast%28cast%28sortable_with_nulls+AS+INTEGER%29+AS+TEXT%29+%21%3D+sortable_with_nulls%0D%0Alimit%0D%0A++1

simonw commented 4 years ago

Maybe sqlite-utils should grow utility functions for guessing types in this way.

simonw commented 4 years ago

Idea: return the first detected value that is not an integer. Then code can see if that value is not-a-float - if it's obviously not-a-float we don't have to run the float detection on it.

simonw commented 4 years ago

Running this query against my dogsheep-beta index is a good test, because the key column there starts with 25,000 numeric tweet IDs before getting the first non-integer GitHub commit. It takes 421ms.

select
  'contains_non_integer' as col, *
from
  search_index
where
  cast(cast(key AS INTEGER) AS TEXT) != key
limit
  1
simonw commented 4 years ago

This method works for detecting integers but doesn't work for floats - because cast(cast(key AS FLOAT) AS TEXT) run against e.g. 415 produces 415.0.

simonw commented 4 years ago

Useful thread on StackOverflow: https://stackoverflow.com/questions/32528759/how-to-check-if-a-value-is-a-number-in-sqlite

simonw commented 3 years ago

This would be really useful, especially as a prompt that displays at the top of the table page (via a fetch() so as not to delay page loading).

Every time I demonstrate CSV import to someone I have to give them a little spiel about how important it is to ensure the types are correct - this would be a good way to improve that.

simonw commented 3 years ago

In https://github.com/simonw/sqlite-utils/issues/282 I added the sqlite-utils insert --detect-types mechanism, but it doesn't use any of the clever SQL from this thread - it instead wraps the ingested CSV data in a not-yet-documented TypeTracker class and uses that to figure out what the types should be.

So there's no sqlite-utils Python library solution for detecting types for an existing table yet.

simonw commented 3 years ago

Could I use a custom sqlite3 aggregate Python function for this? Might be a bit slow.

simonw commented 1 year ago

I'm not going to do this. The new example display from here solves the same problem: