kj-9 / jma-data

Git-scraped Data from the Japanese Meteorological Agency (JMA)
0 stars 0 forks source link

sqlite file exceeds github limit #6

Open kj-9 opened 1 month ago

kj-9 commented 1 month ago

gha errors:

[main ef768d9] Thu Sep 12 20:29:03 UTC 2024
 1 file changed, 0 insertions(+), 0 deletions(-)
Current branch main is up to date.
remote: error: Trace: 8[21](https://github.com/kj-9/jma-data/actions/runs/10838258303/job/30076062958#step:6:22)919348c6e37c0ad0f3a5dfccf8cb7c6b69e1df04b35d2510ad4bd7793ae53        
remote: error: See https://gh.io/lfs for more information.        
remote: error: File data/jma.db is 132.16 MB; this exceeds GitHub's file size limit of 100.00 MB        
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.        
To https://github.com/kj-9/jma-data
 ! [remote rejected] main -> main (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/kj-9/jma-data'
Error: Process completed with exit code 1.
kj-9 commented 1 month ago

separate sqlite file? use attach to other db file

kj-9 commented 1 month ago

or just simply gzip file for quick, temporal fix.

kj-9 commented 1 month ago

one run adds about 20MB for 1 min_temp, 2 max_temp upserts.

kj-9 commented 1 month ago

reduce file size:

  1. compress db file
  2. reduce time granularity. drop starttime from pk. (add record only if valtime is different)

separate files

  1. use attach and output to different db files on each time insert.
kj-9 commented 1 month ago

for now, i go with compress db file it is just temporal, but no data loss.

separate files seems permanent solution, but requires poc.

kj-9 commented 1 month ago

it already exceeds limits: https://github.com/kj-9/jma-data/actions/runs/10867585038/job/30156515653

kj-9 commented 1 month ago

go with this:

  1. reduce time granularity. drop starttime from pk. (add record only if valtime is different)
kj-9 commented 1 month ago

run migrate.sh:

#!/bin/bash
set -eu -o pipefail

# constants
FILE_DB="data/jma.db"

splite() {
    sqlite-utils --load-extension=spatialite "$FILE_DB" "$@"
}

splite '
create table if not exists dates (
    date_id INTEGER primary key not null,
    valid_date TEXT not null unique
);'

splite '
create table if not exists temperature (
    date_id INTEGER not null,
    point_id INTEGER not null,
    min_temp INTEGER,
    max_temp INTEGER,
    primary key (date_id, point_id),
    foreign key (date_id) references dates(date_id)
    foreign key (point_id) references points(point_id)
);'

splite '
insert into dates (valid_date)
select distinct
    -- first 8 characters of valid_time
    substr(valid_time, 1, 8) as valid_date
from times
'

# for min_temp
splite '
insert into temperature (date_id, point_id, min_temp)
with valid_dates as (
select
  max(time_id) as time_id, -- bigger is newer
  substr(valid_time, 1, 8) as valid_date
from times
where exists (
    select 1
    from min_temp
    where min_temp.time_id = times.time_id
)
group by 2
)

select
    dates.date_id,
    min_temp.point_id,
    min_temp.min_temp
from min_temp
  inner join valid_dates using (time_id)
  left join dates using (valid_date)
'

# for max_temp
splite '
insert into temperature (date_id, point_id, max_temp)
with valid_dates as (
select
  max(time_id) as time_id, -- bigger is newer
  substr(valid_time, 1, 8) as valid_date
from times
where exists (
    select 1
    from max_temp
    where max_temp.time_id = times.time_id
)
group by 2
)

select
    dates.date_id,
    max_temp.point_id,
    max_temp.max_temp
from max_temp
  inner join valid_dates using (time_id)
  left join dates using (valid_date)
where true
ON CONFLICT(date_id, point_id) DO UPDATE SET max_temp=excluded.max_temp;
'

splite 'drop table times;'
splite "select DropTable('main', 'min_temp');"
splite "select DropTable('main', 'max_temp');"

sqlite-utils vacuum $FILE_DB
kj-9 commented 1 month ago
$ bash scripts/migrate.sh 
[{"rows_affected": -1}]
[{"rows_affected": -1}]
[{"rows_affected": 9}]
[{"rows_affected": 2047248}]
[{"rows_affected": 2534688}]
[{"rows_affected": -1}]
[{"DropTable('main', 'min_temp')": 1}]
[{"DropTable('main', 'max_temp')": 1}]

before:

$ du -hs data/jma.db
322M    data/jma.db

after:

$ du -hs data/jma.db      
130M    data/jma.db
kj-9 commented 1 month ago

worked

but I will need to do this someday:

separate files

  1. use attach and output to different db files on each time insert.
kj-9 commented 1 month ago

trick to git diff db.gz file:

.git/gitconfig:

[diff "sqlite3.gz"]
    binary = true
    textconv = sh -c 'gunzip -c "$1" > /tmp/sqlite3_temp.db && sqlite3 /tmp/sqlite3_temp.db .dump' --

.gitattribute:

*.db.gz diff=sqlite3.gz
kj-9 commented 1 month ago

in nushull:

# 9/23 morning
$ git show c16e51088de8089f9715cd313cd70f2349164fda:data/jma.db.gz | bytes length | into filesize
6.8 MiB

# 9/24 morning
$ git show 5bb45899f43a1b4c74a42351e1f4963f608ecae7:data/jma.db.gz | bytes length | into filesize
7.1 MiB

+0.3MiB for one day.

meaning +1MiB for three day, +100MiB for 300 day

Need to deel with this after a year.