kj-9 / jma-data

Git-scraped Data from the Japanese Meteorological Agency (JMA)
0 stars 0 forks source link

geojson is too large (migrate to spatialite) #3

Closed kj-9 closed 2 months ago

kj-9 commented 2 months ago

not compressed:

$ ll data/20240908
total 27296
drwxr-xr-x  4 kh03  staff   128B Sep  8 10:52 .
drwxr-xr-x  4 kh03  staff   128B Sep  8 10:52 ..
-rw-r--r--  1 kh03  staff   6.7M Sep  8 10:52 max_temp_point.geojson
-rw-r--r--  1 kh03  staff   6.7M Sep  8 10:52 min_temp_point.geojson

gziped:

 $ ll data/20240907200000
total 696
drwxr-xr-x  5 kh03  staff   160B Sep  8 10:47 .
drwxr-xr-x  4 kh03  staff   128B Sep  8 10:52 ..
-rw-r--r--  1 kh03  staff   114K Sep  8 10:47 max_temp_point-20240908090000.geojson.gz
-rw-r--r--  1 kh03  staff   115K Sep  8 10:47 max_temp_point-20240909090000.geojson.gz
-rw-r--r--  1 kh03  staff   116K Sep  8 10:47 min_temp_point-20240909000000.geojson.gz

even if gziped, each scrape gets 300K

kj-9 commented 2 months ago

there is a lot of redundancy in geojson:

jq .  min_temp_point-20240909000000.geojson | head -n 15
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          141.406,
          45.875
        ]
      },
      "properties": {
        "value": "20"
      } 

where most infomation in geojson is just a Point of XY coordinate.

relational tables using sqlite might be good option to reduce redundancy.

seems utilizing foreign key reduces storage in sqlite: https://stackoverflow.com/questions/18159166/does-the-foreign-key-in-sqlite-store-the-whole-value-or-just-the-pointer-to-the

kj-9 commented 2 months ago

better sqlite diff: https://garrit.xyz/posts/2023-11-01-tracking-sqlite-database-changes-in-git

kj-9 commented 2 months ago

update sqlite but no actual update on geojson:

$ bash scripts/get-geojson.sh
$ find data/tmp  -type f -name "*.geojson"  |  xargs -I {} bash scripts/load-geojson.sh {}

then take diff:

$ git diff                   
diff --git a/data/jma.db b/data/jma.db
index 358068e..1b6df13 100644
--- a/data/jma.db
+++ b/data/jma.db
@@ -6602,6 +6602,8 @@ INSERT INTO spatialite_history VALUES(25,'max_temp_ingest','geometry','Geometry
 INSERT INTO spatialite_history VALUES(26,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:39.815Z','3.46.1','5.1.0');
 INSERT INTO spatialite_history VALUES(27,'min_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:42.337Z','3.46.1','5.1.0');
 INSERT INTO spatialite_history VALUES(28,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:44.846Z','3.46.1','5.1.0');
+INSERT INTO spatialite_history VALUES(29,'min_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T16:16:10.632Z','3.46.1','5.1.0');
+INSERT INTO spatialite_history VALUES(30,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T16:16:13.185Z','3.46.1','5.1.0');
 CREATE TABLE geometry_columns (
 f_table_name TEXT NOT NULL,
 f_geometry_column TEXT NOT NULL,
@@ -565770,7 +565772,7 @@ INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','Spatia
 INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','ElementaryGeometries','ElementaryGeometries',0,'CREATE VIRTUAL TABLE ElementaryGeometries USING VirtualElementary()');
 INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','KNN2','KNN2',0,'CREATE VIRTUAL TABLE KNN2 USING VirtualKNN2()');
 DELETE FROM sqlite_sequence;
-INSERT INTO sqlite_sequence VALUES('spatialite_history',28);
+INSERT INTO sqlite_sequence VALUES('spatialite_history',30);
 INSERT INTO sqlite_sequence VALUES('data_licenses',9);
 CREATE UNIQUE INDEX idx_spatial_ref_sys 
 ON spatial_ref_sys (auth_srid, auth_name);

still shows spatilite_hisotry diff. seems like that will updated anyway. can i turn off this feature to not apear diff in this case?

kj-9 commented 2 months ago

can i know paticular change is worth commit or not by e.g. counting diff of lines

kj-9 commented 2 months ago

came up idea to accumulate rows_affected returned from sqlite-utils call then check if there is any rows upserted when git commit.

if no rows affected, exclude db file to commit.

kj-9 commented 2 months ago

now i need to fix gha

kj-9 commented 2 months ago

it takes 1.5 min to run in gha:

eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew install spatialite-tools
pip install sqlite-utils

may be it is okay to just run:

apt-get install -y --no-install-recommends libsqlite3-mod-spatialite

refs: https://github.com/simonw/datasette/blob/832f76ce26ffb2f3e27a006ff90254374bd90e61/Dockerfile#L8