kartoza / geocris-inasafe-fba

Next generation of InaSAFE that includes support for forecast based action
MIT License
1 stars 1 forks source link

Table size investigation #33

Closed lucernae closed 4 years ago

lucernae commented 4 years ago

Question

Why central-america.osm.pbf data ends up around 25 GB after running vulnerability calculations?

lucernae commented 4 years ago

Finding table size

Query used:

SELECT
    relname AS "relation",
    pg_size_pretty(
        pg_relation_size(C .oid)
        ) as "rel_size",
    pg_size_pretty (
        pg_total_relation_size (C .oid)
    ) AS "total_size"
FROM
    pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C .relnamespace)
WHERE
    nspname NOT IN (
        'pg_catalog',
        'information_schema'
    )
AND C .relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY
    pg_total_relation_size (C .oid) DESC
LIMIT 5;

It should return top 5 tables with the most space. osm_buildings have 22GB

Size breakdown in osm_buildings

Query used:

SELECT l.metric, l.nr AS "bytes/ct"
     , CASE WHEN is_size THEN pg_size_pretty(nr) END AS bytes_pretty
     , CASE WHEN is_size THEN nr / NULLIF(x.ct, 0) END AS bytes_per_row
FROM  (
   SELECT min(tableoid)        AS tbl      -- = 'public.tbl'::regclass::oid
        , count(*)             AS ct
        , sum(length(t::text)) AS txt_len  -- length in characters
   FROM   public.osm_buildings t                     -- provide table name *once*
   ) x
 , LATERAL (
   VALUES
      (true , 'core_relation_size'               , pg_relation_size(tbl))
    , (true , 'visibility_map'                   , pg_relation_size(tbl, 'vm'))
    , (true , 'free_space_map'                   , pg_relation_size(tbl, 'fsm'))
    , (true , 'table_size_incl_toast'            , pg_table_size(tbl))
    , (true , 'indexes_size'                     , pg_indexes_size(tbl))
    , (true , 'total_size_incl_toast_and_indexes', pg_total_relation_size(tbl))
    , (true , 'live_rows_in_text_representation' , txt_len)
    , (false, '------------------------------'   , NULL)
    , (false, 'row_count'                        , ct)
    , (false, 'live_tuples'                      , pg_stat_get_live_tuples(tbl))
    , (false, 'dead_tuples'                      , pg_stat_get_dead_tuples(tbl))
   ) l(is_size, metric, nr);

Result:

metric bytes/ct bytes_pretty bytes_per_row
core_relation_size 17544536064 16 GB 3094
visibility_map 540672 528 kB 0
free_space_map 4333568 4232 kB 0
table_size_incl_toast 17549819904 16 GB 3095
indexes_size 5588377600 5329 MB 985
total_size_incl_toast_and_indexes 23138197504 22 GB 4080
live_rows_in_text_representation 2193423403 2092 MB 386
------------------------------ NULL NULL NULL
row_count 5669884 NULL NULL
live_tuples 56898224 NULL NULL
dead_tuples 0 NULL NULL

in csv:

metric,bytes/ct,bytes_pretty,bytes_per_row
core_relation_size,17544536064,16 GB,3094
visibility_map,540672,528 kB,0
free_space_map,4333568,4232 kB,0
table_size_incl_toast,17549819904,16 GB,3095
indexes_size,5588377600,5329 MB,985
total_size_incl_toast_and_indexes,23138197504,22 GB,4080
live_rows_in_text_representation,2193423403,2092 MB,386
------------------------------,,,
row_count,5669884,,
live_tuples,56898224,,
dead_tuples,0,,

Size breakdown in a (randomly selected) row

Query used for total size in row (using certain row id):

SELECT sum(pg_column_size(t.*)) as filesize FROM (select * from osm_buildings where id = 4352) as t;
Result: filesize
8416

Query used to get all columns name:

select column_name from information_schema.columns where table_name = 'osm_buildings'
Result: column_name
id
osm_id
name
leisure
height
building:levels
building:height
building:min_level
roof:height
roof:material
building:material
use
religion
type
amenity
landuse
geometry
building_type
building_type_score
building_area
building_area_score
building_material_score
building_road_length
building_road_density_score
total_vulnerability
village_id
sub_district_id
district_id
building_road_density
building_id

Query used to get all column size (for a certain row id):

select
pg_column_size(id) as id,
pg_column_size(osm_id) as osm_id,
pg_column_size(name) as name,
pg_column_size(leisure) as leisure,
pg_column_size(height) as height,
pg_column_size("building:levels") as "building:levels",
pg_column_size("building:height") as "building:height",
pg_column_size("building:min_level") as "building:min_level",
pg_column_size("roof:height") as "roof:height",
pg_column_size("roof:material") as "roof:material",
pg_column_size("building:material") as "building:material",
pg_column_size(use) as use,
pg_column_size(religion) as religion,
pg_column_size(type) as type,
pg_column_size(amenity) as amenity,
pg_column_size(landuse) as landuse,
pg_column_size(geometry) as geometry,
pg_column_size(building_type) as building_type,
pg_column_size(building_type_score) as building_type_score,
pg_column_size(building_area) as building_area,
pg_column_size(building_area_score) as building_area_score,
pg_column_size(building_material_score) as building_material_score,
pg_column_size(building_road_length) as building_road_length,
pg_column_size(building_road_density_score) as building_road_density_score,
pg_column_size(total_vulnerability) as total_vulnerability,
pg_column_size(village_id) as village_id,
pg_column_size(sub_district_id) as sub_district_id,
pg_column_size(district_id) as district_id,
pg_column_size(building_road_density) as building_road_density,
pg_column_size(building_id) as building_id
from osm_buildings where id = 4352;
Result: id osm_id name leisure height building:levels building:height building:min_level roof:height roof:material building:material use religion type amenity landuse geometry building_type building_type_score building_area building_area_score building_material_score building_road_length building_road_density_score total_vulnerability village_id sub_district_id district_id building_road_density building_id
4 8 17 1 NULL 1 NULL NULL NULL 1 1 1 1 12 12 12 8244 12 8 8 8 8 NULL NULL 8 NULL NULL NULL NULL NULL

In this particular case, we know that the bulk size comes from geometry. How much of it in total?

Query total geometry size:

select
       total_geometry_size, total_row_size - total_geometry_size as total_non_geometry_size, total_row_size,
       pg_size_pretty(total_geometry_size) as pretty_total_geometry_size,
       pg_size_pretty(total_row_size - total_geometry_size) as pretty_total_non_geometry_size,
       pg_size_pretty(total_row_size) as pretty_total_row_size
from
(select sum(pg_column_size(t.*)) as total_row_size, sum(pg_column_size(geometry)) as total_geometry_size from osm_buildings t) as a;

Result:

total_geometry_size total_non_geometry_size total_row_size pretty_total_geometry_size pretty_total_non_geometry_size pretty_total_row_size
773983117 708148139 1482131256 738 MB 675 MB 1413 MB
lucernae commented 4 years ago

The total of only column sizes does not match with total row size if we use pg_relation_size. There must be some columns or indices that I missed. I don't know where else to look

NyakudyaA commented 4 years ago

@lucernae Where is the sql for the vulnerability calculations. I can try to run on my local machie and see if the problem persist

lucernae commented 4 years ago

@NyakudyaA we can run it, it's just that the size blew up after vulnerability calculations

Fresh dump then restore the same db will output this:

Query used:

SELECT
    relname AS "relation",
    pg_size_pretty(
        pg_relation_size(C .oid)
        ) as "rel_size",
    pg_size_pretty (
        pg_total_relation_size (C .oid)
    ) AS "total_size"
FROM
    pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C .relnamespace)
WHERE
    nspname NOT IN (
        'pg_catalog',
        'information_schema'
    )
AND C .relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY
    pg_total_relation_size (C .oid) DESC
LIMIT 5;

Result:

    relation     | rel_size | total_size 
-----------------+----------+------------
 osm_buildings   | 1491 MB  | 1995 MB
 osm_roads       | 448 MB   | 591 MB
 osm_admin       | 10 MB    | 72 MB
 osm_waterways   | 54 MB    | 69 MB
 spatial_ref_sys | 4584 kB  | 4808 kB
(5 rows)

After vulnerability calculations, same query, same data (the vulnerability will have the same output as before): osm_buildings beocmes 22GB.

So, total_row_size is accurate after all, I just don't know the extra size is for what?

lucernae commented 4 years ago

I need to write some procedure on how to reproduce the problem first @NyakudyaA

lucernae commented 4 years ago

Concluded that the cause is mostly pg_wal directory. When we backup and restore the same data, the overall size becomes much smaller.

NyakudyaA commented 4 years ago

Concluded that the cause is mostly pg_wal directory. When we backup and restore the same data, the overall size becomes much smaller.

This is caused by the env variable --wal-segsize=${WAL_SEGSIZE} It determines how much detail each wal log contains. You also need to tune PG_WAL_KEEP_SEGMENTS to a low value for the wal directory to not store much data