Closed simonw closed 1 year ago
.transform(pk=...)
can handle this, refs #11.
I wonder if this will even need extra validation for unique values, or if I can instead trust .transform()
to raise an exception which can be caught and displayed to the user.
Realized that this feature:
Isn't any use if you can't set a primary key!
Here's a challenge: if the table is small enough I can run a test on each column to see if it only includes unique values (and no null
values) and, if so, suggest it as a potential primary key.
But... if the table is large, that becomes prohibitively expensive. So what do I do there?
I think I give people the option to set a primary key showing them ALL columns, and then throw an error on the next page if they pick a column that isn't actually unique.
Copy:
This SQL query can do a single table scan to figure out the distinct counts:
select
count(*) as _count,
count(distinct id) as 'distinct.id',
count(distinct first_name) as 'distinct.first_name',
count(distinct last_name) as 'distinct.last_name',
count(distinct email) as 'distinct.email'
from
Customers
It needs to maintain a set for each column though, which is why I don't think it should run on tables with millions of rows.
Ideally this wouldn't run on any column that has at least one row that is longer than some threshold, maybe 256 bytes. Not sure how best to do that though.
select
count(*) as _count,
max(length(id)) as 'maxlen.id',
max(length(first_name)) as 'maxlen.first_name',
max(length(last_name)) as 'maxlen.last_name',
max(length(email)) as 'maxlen.email'
from
Customers
I could run that just against the text columns. It shouldn't be too expensive - it's a table scan but SQLite only has to track the highest value, not EVERY value like if it is doing a distinct
.
My criteria for a potential primary key column:
null
valuesselect
count(*) as _count,
max(length(id)) as 'maxlen.id',
max(length(first_name)) as 'maxlen.first_name',
max(length(last_name)) as 'maxlen.last_name',
max(length(email)) as 'maxlen.email',
sum(case when id is null then 1 else 0 end) as 'nulls.id',
sum(case when first_name is null then 1 else 0 end) as 'nulls.first_name',
sum(case when last_name is null then 1 else 0 end) as 'nulls.last_name',
sum(case when email is null then 1 else 0 end) as 'nulls.email'
from
customers;
I can run that query first - it's a full table scan but it will give me back the columns that I should do the more expensive count distinct
on.
Demo:
Sometimes a table will have been imported that has a primary key but is currently being treated as a
rowid
table.An option for picking a column as a primary key would be useful - but it ideally would validate that the selected column does not have duplicate values.
It could also suggest a primary key, using a time-constrained query to check each column for all-unique not-null values.