Open kj-9 opened 1 month ago
separate sqlite file?
use attach
to other db file
or just simply gzip file for quick, temporal fix.
one run adds about 20MB for 1 min_temp, 2 max_temp upserts.
reduce file size:
time
granularity. drop starttime from pk. (add record only if valtime is different)separate files
attach
and output to different db files on each time
insert.for now, i go with compress db file
it is just temporal, but no data loss.
separate files
seems permanent solution, but requires poc.
it already exceeds limits: https://github.com/kj-9/jma-data/actions/runs/10867585038/job/30156515653
go with this:
- reduce time granularity. drop starttime from pk. (add record only if valtime is different)
run migrate.sh:
#!/bin/bash
set -eu -o pipefail
# constants
FILE_DB="data/jma.db"
splite() {
sqlite-utils --load-extension=spatialite "$FILE_DB" "$@"
}
splite '
create table if not exists dates (
date_id INTEGER primary key not null,
valid_date TEXT not null unique
);'
splite '
create table if not exists temperature (
date_id INTEGER not null,
point_id INTEGER not null,
min_temp INTEGER,
max_temp INTEGER,
primary key (date_id, point_id),
foreign key (date_id) references dates(date_id)
foreign key (point_id) references points(point_id)
);'
splite '
insert into dates (valid_date)
select distinct
-- first 8 characters of valid_time
substr(valid_time, 1, 8) as valid_date
from times
'
# for min_temp
splite '
insert into temperature (date_id, point_id, min_temp)
with valid_dates as (
select
max(time_id) as time_id, -- bigger is newer
substr(valid_time, 1, 8) as valid_date
from times
where exists (
select 1
from min_temp
where min_temp.time_id = times.time_id
)
group by 2
)
select
dates.date_id,
min_temp.point_id,
min_temp.min_temp
from min_temp
inner join valid_dates using (time_id)
left join dates using (valid_date)
'
# for max_temp
splite '
insert into temperature (date_id, point_id, max_temp)
with valid_dates as (
select
max(time_id) as time_id, -- bigger is newer
substr(valid_time, 1, 8) as valid_date
from times
where exists (
select 1
from max_temp
where max_temp.time_id = times.time_id
)
group by 2
)
select
dates.date_id,
max_temp.point_id,
max_temp.max_temp
from max_temp
inner join valid_dates using (time_id)
left join dates using (valid_date)
where true
ON CONFLICT(date_id, point_id) DO UPDATE SET max_temp=excluded.max_temp;
'
splite 'drop table times;'
splite "select DropTable('main', 'min_temp');"
splite "select DropTable('main', 'max_temp');"
sqlite-utils vacuum $FILE_DB
$ bash scripts/migrate.sh
[{"rows_affected": -1}]
[{"rows_affected": -1}]
[{"rows_affected": 9}]
[{"rows_affected": 2047248}]
[{"rows_affected": 2534688}]
[{"rows_affected": -1}]
[{"DropTable('main', 'min_temp')": 1}]
[{"DropTable('main', 'max_temp')": 1}]
before:
$ du -hs data/jma.db
322M data/jma.db
after:
$ du -hs data/jma.db
130M data/jma.db
but I will need to do this someday:
separate files
- use attach and output to different db files on each time insert.
trick to git diff db.gz file:
.git/gitconfig:
[diff "sqlite3.gz"]
binary = true
textconv = sh -c 'gunzip -c "$1" > /tmp/sqlite3_temp.db && sqlite3 /tmp/sqlite3_temp.db .dump' --
.gitattribute:
*.db.gz diff=sqlite3.gz
in nushull:
# 9/23 morning
$ git show c16e51088de8089f9715cd313cd70f2349164fda:data/jma.db.gz | bytes length | into filesize
6.8 MiB
# 9/24 morning
$ git show 5bb45899f43a1b4c74a42351e1f4963f608ecae7:data/jma.db.gz | bytes length | into filesize
7.1 MiB
+0.3MiB for one day.
meaning +1MiB for three day, +100MiB for 300 day
Need to deel with this after a year.
gha errors: