Closed simonw closed 2 years ago
Columns are slightly different:
URL,TEXT,WIDTH,HEIGHT,similarity,punsafe,pwatermark,AESTHETIC_SCORE,hash,__index_level_0__
sqlite3 laion-aesthetic-6pls.db '
CREATE TABLE IF NOT EXISTS images (
[url] TEXT,
[text] TEXT,
[width] INTEGER,
[height] INTEGER,
[similarity] FLOAT,
[punsafe] FLOAT,
[pwatermark] FLOAT,
[aesthetic] FLOAT,
[hash] TEXT,
[__index_level_0__] INTEGER
);'
for filename in *.parquet; do
parquet-tools csv $filename | sqlite3 -csv laion-aesthetic-6pls.db ".import --skip 1 '|cat -' images"
done
Then:
time sqlite-utils enable-fts laion-aesthetic-6pls.db images text
sqlite-utils enable-fts laion-aesthetic-6pls.db images text 81.36s user 6.00s system 99% cpu 1:28.14 total
Search works. I'm going to re-run this:
And push it to the server.
sqlite-utils laion-aesthetic-6pls.db '
with counts as (
select domain_id, count(*) as c from images group by domain_id
)
update domain
set image_count = counts.c
from counts
where id = counts.domain_id
'
% time sqlite-utils extract laion-aesthetic-6pls.db images domain
sqlite-utils extract laion-aesthetic-6pls.db images domain 53.39s user 92.59s system 79% cpu 3:03.63 total
I broke the server by deleting files there, and now I can't scp
them back again because the server won't run so there is nothing to SSH to.
Need to deploy it fresh. Previously I used:
datasette publish fly \
--app laion-aesthetic \
--volume-name datasette \
--extra-options "-i /data/data.db \
--inspect-file /data/inspect.json \
--setting sql_time_limit_ms 10000 \
--setting suggest_facets 0 \
--setting allow_download 0" -m metadata.yml
To get it running again I'll do this:
datasette publish fly \
--app laion-aesthetic \
--volume-name datasette
Then I'll scp
up the laion-aesthetic-6pls.db
file, then ssh
in and run datasette inspect laion-aesthetic-6pls.db > /data/inspect.json
, then run:
datasette publish fly \
--app laion-aesthetic \
--volume-name datasette \
--install datasette-json-html \
--extra-options "-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0" \
-m metadata.yml
Where metadata.yml
looks like this:
databases:
laion-aesthetic-6pls:
tables:
domain:
label_column: domain
It's on the server now:
sqlite-utils add-column laion-aesthetic-6pls.db domain image_counts integer
sqlite-utils laion-aesthetic-6pls.db '
with counts as (
select domain_id, count(*) as c from images group by domain_id
)
update domain
set image_counts = counts.c
from counts
where id = counts.domain_id
'
datasette inspect laion-aesthetic-6pls.db > inspect.json
Now running that deploy:
datasette publish fly \
--app laion-aesthetic \
--volume-name datasette \
--install datasette-json-html \
--extra-options "-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0" \
-m metadata.yml
Actually needs this metadata:
databases:
laion-aesthetic-6pls:
tables:
domain:
label_column: domain
laion-aesthetic-6pls-2:
allow: false
data:
allow: false
Next step: celebrity counts from:
We realized we've been using v1 of the training data, but Stable Diffusion is trained on v2.
Going to try this instead: https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus