Closed tschmetzer closed 3 years ago
Can you try running vacuum on that database? https://www.sqlite.org/lang_vacuum.html
Thanks for pointing at that interesting tool coming along with the information that the DB can get fragmented. Didn't help much though as the vacuumed file with integer values (SMOD_POP_KG_DNI500_melilla-points_int_vacuumed.gpkg) is sill larger than original one with double values (SMOD_POP_KG_DNI500_melilla-points.gpkg) even when vacuumed (SMOD_POP_KG_DNI500_melilla-points_vacuumed.gpkg):
Can you do a bit more analysis? Basically the first step is to make sure it's QGIS (and not SQLITE) being the problem here.
Interesting would be
I cannot replicate the issue using QGIS 3.16.11 old (v1/EXE) OSGeo4W (SQLite 3.29.0) on Windows 7 64 bit. The following steps leads to the same exact file size (as expected):
I cannot replicate the issue also using QGIS 3.20.3 new (v2/MSI) OSGeo4W (SQLite 3.35.2) on Windows 10 64 bit.
Referring to my previous statement:
Thank you @agiudiceandrea for your replication efforts! I don't know why but today I cannot replicate this either. I had used the sqlite3 binary to vacuum the file SMOD_POP_KG_DNI500_melilla-points_int.gpkg but today it results in 760KB as well as when using the GeoPackage connection in QGIS Browser to vacuum the file.
What I did achieve on two independant machines (QGIS 3.20.1 on both) was to blow the file to 776KB (794.624 Bytes) when manually reversing DNI_2 to DNI_1. Meaning:
Though my intention was actually not to proof an error in reversability but this is probably something that should be looked at as well. I tried to do some investigation as proposed by @m-kuhn using https://www.sqlite.org/dbstat.html
The following tables marked manually with (*) have had changes in data:
sqlite> .table
gpkg_contents
gpkg_extensions
gpkg_geometry_columns
gpkg_ogr_contents
gpkg_spatial_ref_sys
gpkg_tile_matrix
gpkg_tile_matrix_set
rtree_smod-pop-dni500_geom
*rtree_smod-pop-dni500_geom_node
*rtree_smod-pop-dni500_geom_parent
*rtree_smod-pop-dni500_geom_rowid
smod-pop-dni500
SMOD_POP_KG_DNI500_melilla-points_int.gpkg after reversing DNI_2 to DNI_1:
sqlite> .databases
main: J:\SBC-GIS\melilla-int-bug\SMOD_POP_KG_DNI500_melilla-points_int.gpkg r/w
sqlite> select pgsize,payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
4096|0|3904|93.5595703125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|30534|5276|93.5595703125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_parent";
4096|940|2142|47.705078125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_node";
299008|264880|31768|89.3755351027397
original SMOD_POP_KG_DNI500_melilla-points.gpkg:
sqlite> .databases
main: J:\SBC-GIS\melilla-int-bug\SMOD_POP_KG_DNI500_melilla-points.gpkg r/w
sqlite> select pgsize,payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
4096|0|3904|91.2548828125
sqlite> select sum(pgsize),payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|0|3904|91.2548828125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|28646|7164|91.2548828125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_parent";
4096|889|2315|43.4814453125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_node";
282624|250096|30387|89.2482591711957
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="smod-pop-dni500";
327680|274555|5935|98.1887817382813
Unfortunately I am not proficient enough with databases and even less with sqlite to dig down and compare the datasets that cause the trouble. Any hints how to dig deeper?
Expanding and returning to my initial topic: Even if file sizes don't grow after converting double->int
and subsequent vacuuming I still don't get why using smaller field types like int (4 bytes) result in the same file sizes as doubles (8 bytes).
Even if file sizes don't grow after converting
double->int
and subsequent vacuuming I still don't get why using smaller field types like int (4 bytes) result in the same file sizes as doubles (8 bytes).
Datatypes in SQLite are handled dynamically and they differ from common C++ static types: "In SQLite, the datatype of a value is associated with the value itself, not with its container." (See https://www.sqlite.org/datatype3.html)
AFAIK, this means that the numeric integer value 6500 will be stored as a 2-bytes integer ("a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value") in both an INTEGER type and a REAL type column. But the numeric floating point value 6500.123 will be stored as a 8-byte IEEE floating point ("a floating point value, stored as an 8-byte IEEE floating point number") in both a REAL type and a INTEGER type column. Consider also that "As an internal optimization, small floating point values with no fractional component and stored in columns with REAL affinity are written to disk as integers in order to take up less space and are automatically converted back into floating point as the value is read out. This optimization is completely invisible at the SQL level and can only be detected by examining the raw bits of the database file."
Thank you so much @agiudiceandrea for pointing to https://www.sqlite.org/datatype3.html! That explains my main question.
What a pity I couldn't replicate the original issue of a double->int converted and vacuumed DB file being larger than the original one so far.
The file size increase after a conversion double->int and then reverting int->double is still unclear and could reveal some bad code.
The file size increase after a conversion double->int and then reverting int->double is still unclear and could reveal some bad code.
yes, I can also reproduce this. The compacted size goes from 778240 to 794624. Looking at the output of "echo .dump | sqlite3 the.db > the.db.txt" before original and the one after the 2 conversions, one can see that the order into which tables are stored in the file is not the same, and some differences in the spatial index. However at high level, the spatial index seems to be OK. I guess one must accept this as side effects of the compaction mechanism in SQLite3.
Closing as not a bug
What is the bug or the crash?
Common C++ types use 4 bytes for integer and 8 bytes for double. Though expecting the file sizes to decrease after converting a double field to integer and removing the double field the converted file grows even larger. The used file type is .gpkg. Haven't tried with any other file type.
File sizes in File manager:
Steps to reproduce the issue
When trying to reproduce this a second time I got a different but still larger file size of 860 kB. Yet the original resulting "corrupted" file SMOD_POP_KG_DNI500_melilla-points_int.gpkg (1.208kB) is contained in the archive. SMOD_POP_KG_DNI500_melilla_points.zip
Feature count and contents look exactly the same
Versions
QGIS version 3.20.2-Odense QGIS code revision 9f59a156 Qt version 5.15.2 Python version 3.9.5 GDAL/OGR version 3.3.1 PROJ version 8.1.0 EPSG Registry database version v10.027 (2021-06-17) GEOS version 3.9.1-CAPI-1.14.2 SQLite version 3.35.2 PDAL version 2.3.0 PostgreSQL client version 13.0 SpatiaLite version 5.0.1 QWT version 6.1.3 QScintilla2 version 2.11.5 OS version Windows 10 Version 2009
Active Python plugins GroupStats QuickOSM db_manager MetaSearch processing
Supported QGIS version
New profile
Additional context
No response