simonw / laion-aesthetic-datasette

Use Datasette to explore LAION improved_aesthetics_6plus training data used by Stable DIffusion
58 stars 12 forks source link

Try improved_aesthetics_6plus instead #7

Closed simonw closed 2 years ago

simonw commented 2 years ago

We realized we've been using v1 of the training data, but Stable Diffusion is trained on v2.

Going to try this instead: https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus

simonw commented 2 years ago

Columns are slightly different:

URL,TEXT,WIDTH,HEIGHT,similarity,punsafe,pwatermark,AESTHETIC_SCORE,hash,__index_level_0__

simonw commented 2 years ago
sqlite3 laion-aesthetic-6pls.db '
CREATE TABLE IF NOT EXISTS images (
   [url] TEXT,
   [text] TEXT,
   [width] INTEGER,
   [height] INTEGER,
   [similarity] FLOAT,
   [punsafe] FLOAT,
   [pwatermark] FLOAT,
   [aesthetic] FLOAT,
   [hash] TEXT,
   [__index_level_0__] INTEGER
);'
for filename in *.parquet; do
    parquet-tools csv $filename | sqlite3 -csv laion-aesthetic-6pls.db ".import --skip 1 '|cat -' images"
done

Then:

time sqlite-utils enable-fts laion-aesthetic-6pls.db images text
sqlite-utils enable-fts laion-aesthetic-6pls.db images text  81.36s user 6.00s system 99% cpu 1:28.14 total
simonw commented 2 years ago

Search works. I'm going to re-run this:

And push it to the server.

simonw commented 2 years ago
sqlite-utils laion-aesthetic-6pls.db '
with counts as (
  select domain_id, count(*) as c from images group by domain_id
)
update domain
  set image_count = counts.c
  from counts
  where id = counts.domain_id
'
% time sqlite-utils extract laion-aesthetic-6pls.db images domain
sqlite-utils extract laion-aesthetic-6pls.db images domain  53.39s user 92.59s system 79% cpu 3:03.63 total
simonw commented 2 years ago

I broke the server by deleting files there, and now I can't scp them back again because the server won't run so there is nothing to SSH to.

simonw commented 2 years ago

Need to deploy it fresh. Previously I used:

datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --extra-options "-i /data/data.db \
  --inspect-file /data/inspect.json \
  --setting sql_time_limit_ms 10000 \
  --setting suggest_facets 0 \
  --setting allow_download 0" -m metadata.yml

To get it running again I'll do this:

datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette
simonw commented 2 years ago

Then I'll scp up the laion-aesthetic-6pls.db file, then ssh in and run datasette inspect laion-aesthetic-6pls.db > /data/inspect.json, then run:

datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --install datasette-json-html \
  --extra-options "-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0" \
  -m metadata.yml

Where metadata.yml looks like this:

databases:
  laion-aesthetic-6pls:
    tables:
      domain:
        label_column: domain
simonw commented 2 years ago

It's on the server now:

sqlite-utils add-column laion-aesthetic-6pls.db domain image_counts integer
sqlite-utils laion-aesthetic-6pls.db '
with counts as (
  select domain_id, count(*) as c from images group by domain_id
)
update domain
  set image_counts = counts.c
  from counts
  where id = counts.domain_id
'
datasette inspect laion-aesthetic-6pls.db > inspect.json
simonw commented 2 years ago

Now running that deploy:

datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --install datasette-json-html \
  --extra-options "-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0" \
  -m metadata.yml
simonw commented 2 years ago

Actually needs this metadata:

databases:
  laion-aesthetic-6pls:
    tables:
      domain:
        label_column: domain
  laion-aesthetic-6pls-2:
    allow: false
  data:
    allow: false
simonw commented 2 years ago

It's up and running:

simonw commented 2 years ago

Next step: celebrity counts from: