Closed kj-9 closed 2 months ago
there is a lot of redundancy in geojson:
jq . min_temp_point-20240909000000.geojson | head -n 15
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
141.406,
45.875
]
},
"properties": {
"value": "20"
}
where most infomation in geojson is just a Point of XY coordinate.
relational tables using sqlite might be good option to reduce redundancy.
seems utilizing foreign key reduces storage in sqlite: https://stackoverflow.com/questions/18159166/does-the-foreign-key-in-sqlite-store-the-whole-value-or-just-the-pointer-to-the
better sqlite diff: https://garrit.xyz/posts/2023-11-01-tracking-sqlite-database-changes-in-git
update sqlite but no actual update on geojson:
$ bash scripts/get-geojson.sh
$ find data/tmp -type f -name "*.geojson" | xargs -I {} bash scripts/load-geojson.sh {}
then take diff:
$ git diff
diff --git a/data/jma.db b/data/jma.db
index 358068e..1b6df13 100644
--- a/data/jma.db
+++ b/data/jma.db
@@ -6602,6 +6602,8 @@ INSERT INTO spatialite_history VALUES(25,'max_temp_ingest','geometry','Geometry
INSERT INTO spatialite_history VALUES(26,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:39.815Z','3.46.1','5.1.0');
INSERT INTO spatialite_history VALUES(27,'min_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:42.337Z','3.46.1','5.1.0');
INSERT INTO spatialite_history VALUES(28,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T15:48:44.846Z','3.46.1','5.1.0');
+INSERT INTO spatialite_history VALUES(29,'min_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T16:16:10.632Z','3.46.1','5.1.0');
+INSERT INTO spatialite_history VALUES(30,'max_temp_ingest','geometry','Geometry [POINT,XY,SRID=4326] successfully created','2024-09-09T16:16:13.185Z','3.46.1','5.1.0');
CREATE TABLE geometry_columns (
f_table_name TEXT NOT NULL,
f_geometry_column TEXT NOT NULL,
@@ -565770,7 +565772,7 @@ INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','Spatia
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','ElementaryGeometries','ElementaryGeometries',0,'CREATE VIRTUAL TABLE ElementaryGeometries USING VirtualElementary()');
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','KNN2','KNN2',0,'CREATE VIRTUAL TABLE KNN2 USING VirtualKNN2()');
DELETE FROM sqlite_sequence;
-INSERT INTO sqlite_sequence VALUES('spatialite_history',28);
+INSERT INTO sqlite_sequence VALUES('spatialite_history',30);
INSERT INTO sqlite_sequence VALUES('data_licenses',9);
CREATE UNIQUE INDEX idx_spatial_ref_sys
ON spatial_ref_sys (auth_srid, auth_name);
still shows spatilite_hisotry
diff. seems like that will updated anyway.
can i turn off this feature to not apear diff in this case?
can i know paticular change is worth commit or not by e.g. counting diff of lines
came up idea to accumulate rows_affected
returned from sqlite-utils call then check if there is any rows upserted when git commit.
if no rows affected, exclude db file to commit.
now i need to fix gha
it takes 1.5 min to run in gha:
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew install spatialite-tools
pip install sqlite-utils
may be it is okay to just run:
apt-get install -y --no-install-recommends libsqlite3-mod-spatialite
refs: https://github.com/simonw/datasette/blob/832f76ce26ffb2f3e27a006ff90254374bd90e61/Dockerfile#L8
not compressed:
gziped:
even if gziped, each scrape gets 300K