simonw / datasette-edit-schema

Datasette plugin for modifying table schemas
Apache License 2.0
16 stars 0 forks source link

Ability to set a primary key #1

Closed simonw closed 1 year ago

simonw commented 4 years ago

Sometimes a table will have been imported that has a primary key but is currently being treated as a rowid table.

An option for picking a column as a primary key would be useful - but it ideally would validate that the selected column does not have duplicate values.

It could also suggest a primary key, using a time-constrained query to check each column for all-unique not-null values.

simonw commented 3 years ago

.transform(pk=...) can handle this, refs #11.

simonw commented 3 years ago

I wonder if this will even need extra validation for unique values, or if I can instead trust .transform() to raise an exception which can be caught and displayed to the user.

simonw commented 1 year ago

Realized that this feature:

Isn't any use if you can't set a primary key!

simonw commented 1 year ago

Here's a challenge: if the table is small enough I can run a test on each column to see if it only includes unique values (and no null values) and, if so, suggest it as a potential primary key.

But... if the table is large, that becomes prohibitively expensive. So what do I do there?

I think I give people the option to set a primary key showing them ALL columns, and then throw an error on the next page if they pick a column that isn't actually unique.

simonw commented 1 year ago

Copy:

simonw commented 1 year ago

This SQL query can do a single table scan to figure out the distinct counts:

select
  count(*) as _count,
  count(distinct id) as 'distinct.id',
  count(distinct first_name) as 'distinct.first_name',
  count(distinct last_name) as 'distinct.last_name',
  count(distinct email) as 'distinct.email'
from
  Customers

It needs to maintain a set for each column though, which is why I don't think it should run on tables with millions of rows.

CleanShot 2023-09-06 at 15 57 21@2x
simonw commented 1 year ago

Ideally this wouldn't run on any column that has at least one row that is longer than some threshold, maybe 256 bytes. Not sure how best to do that though.

simonw commented 1 year ago
select
  count(*) as _count,
  max(length(id)) as 'maxlen.id',
  max(length(first_name)) as 'maxlen.first_name',
  max(length(last_name)) as 'maxlen.last_name',
  max(length(email)) as 'maxlen.email'
from
  Customers

I could run that just against the text columns. It shouldn't be too expensive - it's a table scan but SQLite only has to track the highest value, not EVERY value like if it is doing a distinct.

simonw commented 1 year ago

My criteria for a potential primary key column:

simonw commented 1 year ago
select
  count(*) as _count,
  max(length(id)) as 'maxlen.id',
  max(length(first_name)) as 'maxlen.first_name',
  max(length(last_name)) as 'maxlen.last_name',
  max(length(email)) as 'maxlen.email',
  sum(case when id is null then 1 else 0 end) as 'nulls.id',
  sum(case when first_name is null then 1 else 0 end) as 'nulls.first_name',
  sum(case when last_name is null then 1 else 0 end) as 'nulls.last_name',
  sum(case when email is null then 1 else 0 end) as 'nulls.email'
from
  customers;

I can run that query first - it's a full table scan but it will give me back the columns that I should do the more expensive count distinct on.

CleanShot 2023-09-06 at 15 56 45@2x
simonw commented 1 year ago

Demo:

pk-demo