minus34 / gnaf-loader

A quick way to get started with Geoscape's open GNAF & Admin Boundaries
Apache License 2.0
182 stars 66 forks source link

Lots of Shapefile SQL Errors #58

Open esb opened 3 years ago

esb commented 3 years ago

I've been trying to load this datbase today, but I'm getting odd errors that I'm not sure about.

This is running on Centos 7 with Python 3.8.6 with Psycopg2 2.8.4 and Postgres 13.4 on x86_64-pc-linux-gnu and PostGIS 3.1.3 (with GEOS 3.9.1-CAPI-1.14.2).

I've downloaded the Geoscape Admin Boundaries - ESRI Shapefile - GDA2020(ZIP) file, which I'm hoping is the correct one.

The first error is

Importing ACT_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist

This is true. The "geom" column does not exist in the table.

Then we launch into an awful lot of these errors.

root        : WARNING   Importing WA_TOWN_POINT_shp.dbf - Couldn't run Shapefile SQL
shp2pgsql result was: current transaction is aborted, commands ignored until end of transaction block

Is this because of that first error?

minus34 commented 3 years ago

The (now updated) readme was missing a note to download the GDA94 Shapefile version of the Admin Bdys (not the GDA2020 ones)!

Please try again with the GDA94 files

esb commented 3 years ago

I downloaded the GDA94 files, but the same errors are coming up.

minus34 commented 3 years ago

Odd error! A couple of things to check:

  1. The PostGIS extension is enabled on your database (using the following SQL - CREATE EXTENSION postgis); and
  2. The Postgres bin folder is in your $PATH (run shp2pgsql from the command line to confirm)
esb commented 3 years ago

Yep. Both of those things have been done.

minus34 commented 3 years ago

One more thing to double check - have you got the latest code? The August 2021 release PR was this morning, Sydney time, and there have been a couple of significant changes to the data (thanks Geoscape!)

Other than that - can you please share the full log with me and also the command line you're using to run it

esb commented 3 years ago

Ok, my original hunch was right. The mass of errors is due to a locked transaction from the first failure.

I added the "-e" flag to the shp2pgsql command to treat each command individually, and not as part of a transaction.

After this was done most of the imports went smoothly, except for the one error for each state because of a missing column.

09/07/2021 07:04:49 PM Part 3 of 6 : Start raw admin boundary load : 2021-09-07 19:04:49.802582
09/07/2021 07:04:58 PM  Importing ACT_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...                                                             ^
09/07/2021 07:08:32 PM  Importing NSW_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:08:56 PM  Importing NT_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:10:46 PM  Importing QLD_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:11:44 PM  Importing SA_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:13:09 PM  Importing TAS_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:15:10 PM  Importing VIC_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
09/07/2021 07:16:32 PM  Importing WA_TOWN_POINT_shp.shp - Couldn't run Shapefile SQL
shp2pgsql result was: column "geom" of relation "aus_town_point" does not exist
LINE 3: ..."town_point","date_creat","date_retir","town_pid",geom) VALU...
minus34 commented 3 years ago

Interesting - the code should import SHP files of the same layer (e.g town points) sequentially, so PG shouldn't be locking the table. I'll go with it being a knock-on effect from the 1st failure... but could be the code isn't doing that...

A couple of more avenues to investigate re: the initial error:

  1. When a shapefile load fails at the psql stage (like the above error appears to have) - it should output a .sql file in the same folder as the code - can you attach it to this issue
  2. Does this table exist in Postgres: admin_bdys_202108.aus_town_point and does it have the geom column?
esb commented 3 years ago

The error file for error_debug_ACT_TOWN_POINT_shp.shp.sql is

SET CLIENT_ENCODING TO UTF8;
SET STANDARD_CONFORMING_STRINGS TO ON;
INSERT INTO "raw_admin_bdys_202108"."aus_town_point" ("town_point","date_creat","date_retir","town_pid",geom) VALUES ('TWP8373','20130731',NULL,'TWN8373','0101000020BB1000005DDB641F27A2624072956B2D33C141C0');
INSERT INTO "raw_admin_bdys_202108"."aus_town_point" ("town_point","date_creat","date_retir","town_pid",geom) VALUES ('TWP8374','20130731',NULL,'TWN8374','0101000020BB100000E707B0F32FA262407145A1E6BD9541C0');
INSERT INTO "raw_admin_bdys_202108"."aus_town_point" ("town_point","date_creat","date_retir","town_pid",geom) VALUES ('TWP8375','20130731',NULL,'TWN8375','0101000020BB1000009879612129A4624083647620CFA341C0');
ANALYZE "raw_admin_bdys_202108"."aus_town_point";

The table exists, but is missing the geom column.

minus34 commented 3 years ago

Ok, so shp2pgsql worked, but the table missing the geom field is very strange...

I'm short of ideas right now as I've run this multiple times over the last few days on 2 Macs without that issue; and it hasn't come up in the last 3-4 years. I can't see why CentOS would be any different. Linux != Unix, but close enough...

A corrupt PostGIS install maybe?! Can you double check this SQL works in your DB:

select st_makepoint(151.0, -33.0, 4283);

esb commented 3 years ago

Yep. That SQL works fine.

Here's the dump of the table schema.

CREATE TABLE IF NOT EXISTS raw_admin_bdys_202108.aus_town_point
(
    gid integer NOT NULL DEFAULT nextval('raw_admin_bdys_202108.aus_town_point_gid_seq'::regclass),
    town_point character varying(15) COLLATE pg_catalog."default",
    date_creat date,
    date_retir date,
    town_pid character varying(15) COLLATE pg_catalog."default",
    CONSTRAINT aus_town_point_pkey PRIMARY KEY (gid)
)

I'm not sure I understand where the table schemas come from.

minus34 commented 3 years ago

The table is created by the first state to be loaded for each layer. So ACT_TOWN_POINT_shp.shp should create the table aus_town_point with that file's structure. The -dflag in shp2pgsql forces a create table statement to be output with the SQL.

Another test! - uncomment line 379 in load-gnaf.py ([print(table) for table in create_list]) and send through the full list of dicts. A working thought is that the code is behaving differently on CentOS and is putting ACT_TOWN_POINTS_shp.dbf as the first aus_town_point input...

There should be no .dbf files for layers that have a .shp file.

BTW - I had thought about refactoring the Shapefile load to use SHP > GeoPandas > PG, now I'm wishing I had :-)

esb commented 3 years ago

I can see the problem....

This file is processed first -

shp2pgsql -e -d -I -G -n -i "/vagrant_data/geoscape_202108/AUG21_Admin_Boundaries_ESRIShapefileorDBFfile/TownPoints_AUG13_GDA94_SHP/Town Points/Town Points AUGUST 2013/Standard/TAS_TOWN_POINT_shp.dbf" raw_admin_bdys_202108.aus_town_point

Then this file comes afterwards -

shp2pgsql -e -a -s 4283 -i "/vagrant_data/geoscape_202108/AUG21_Admin_Boundaries_ESRIShapefileorDBFfile/TownPoints_AUG13_GDA94_SHP/Town Points/Town Points AUGUST 2013/Standard/TAS_TOWN_POINT_shp.shp" raw_admin_bdys_202108.aus_town_point

The dbf file creates the table, but this does not create the geom column. When the shp file is appended, it will crash.

So the problem is that a .dbf file exists for the aus_town_point table.

esb commented 3 years ago

Yes. The order seems to be wrong.

TAS_GCCSA_2011_POLYGON_shp.dbf TAS_GCCSA_2011_POLYGON_shp.shp TAS_GCCSA_2011_shp.dbf TAS_MB_2011_POLYGON_shp.dbf TAS_MB_2011_POLYGON_shp.shp TAS_MB_2011_shp.dbf TAS_SA1_2011_POLYGON_shp.dbf TAS_SA1_2011_POLYGON_shp.shp TAS_SA1_2011_shp.dbf TAS_SA2_2011_POLYGON_shp.dbf TAS_SA2_2011_POLYGON_shp.shp TAS_SA2_2011_shp.dbf TAS_SA3_2011_POLYGON_shp.dbf TAS_SA3_2011_POLYGON_shp.shp TAS_SA3_2011_shp.dbf TAS_SA4_2011_POLYGON_shp.dbf TAS_SA4_2011_POLYGON_shp.shp TAS_SA4_2011_shp.dbf TAS_IARE_2011_POLYGON_shp.dbf TAS_IARE_2011_POLYGON_shp.shp TAS_IARE_2011_shp.dbf TAS_ILOC_2011_POLYGON_shp.dbf TAS_ILOC_2011_POLYGON_shp.shp TAS_ILOC_2011_shp.dbf TAS_IREG_2011_POLYGON_shp.dbf TAS_IREG_2011_POLYGON_shp.shp TAS_IREG_2011_shp.dbf TAS_REMOTENESS_2011_POLYGON_shp.dbf TAS_REMOTENESS_2011_POLYGON_shp.shp TAS_REMOTENESS_2011_shp.dbf TAS_SEIFA_2011_shp.dbf TAS_SUA_2011_POLYGON_shp.dbf TAS_SUA_2011_POLYGON_shp.shp TAS_SUA_2011_shp.dbf TAS_SOSR_2011_POLYGON_shp.dbf TAS_SOSR_2011_POLYGON_shp.shp TAS_SOSR_2011_shp.dbf TAS_SOS_2011_POLYGON_shp.dbf TAS_SOS_2011_POLYGON_shp.shp TAS_SOS_2011_shp.dbf TAS_UCL_2011_POLYGON_shp.dbf TAS_UCL_2011_POLYGON_shp.shp TAS_UCL_2011_shp.dbf TAS_IARE_2016_POLYGON_shp.dbf TAS_IARE_2016_POLYGON_shp.shp TAS_IARE_2016_shp.dbf TAS_ILOC_2016_POLYGON_shp.dbf TAS_ILOC_2016_POLYGON_shp.shp TAS_ILOC_2016_shp.dbf TAS_IREG_2016_POLYGON_shp.dbf TAS_IREG_2016_POLYGON_shp.shp TAS_IREG_2016_shp.dbf TAS_GCCSA_2016_POLYGON_shp.dbf TAS_GCCSA_2016_POLYGON_shp.shp TAS_GCCSA_2016_shp.dbf TAS_MB_2016_POLYGON_shp.dbf TAS_MB_2016_POLYGON_shp.shp TAS_MB_2016_shp.dbf TAS_SA1_2016_POLYGON_shp.dbf TAS_SA1_2016_POLYGON_shp.shp TAS_SA1_2016_shp.dbf TAS_SA2_2016_POLYGON_shp.dbf TAS_SA2_2016_POLYGON_shp.shp TAS_SA2_2016_shp.dbf TAS_SA3_2016_POLYGON_shp.dbf TAS_SA3_2016_POLYGON_shp.shp TAS_SA3_2016_shp.dbf TAS_SA4_2016_POLYGON_shp.dbf TAS_SA4_2016_POLYGON_shp.shp TAS_SA4_2016_shp.dbf TAS_REMOTENESS_2016_POLYGON_shp.dbf TAS_REMOTENESS_2016_POLYGON_shp.shp TAS_REMOTENESS_2016_shp.dbf TAS_SEIFA_2016_shp.dbf TAS_SOSR_2016_POLYGON_shp.dbf TAS_SOSR_2016_POLYGON_shp.shp TAS_SOSR_2016_shp.dbf TAS_SOS_2016_POLYGON_shp.dbf TAS_SOS_2016_POLYGON_shp.shp TAS_SOS_2016_shp.dbf TAS_SUA_2016_POLYGON_shp.dbf TAS_SUA_2016_POLYGON_shp.shp TAS_SUA_2016_shp.dbf TAS_UCL_2016_POLYGON_shp.dbf TAS_UCL_2016_POLYGON_shp.shp TAS_UCL_2016_shp.dbf tas_gccsa_2021.shp tas_mb_2021.shp tas_sa1_2021.shp tas_sa2_2021.shp tas_sa3_2021.shp tas_sa4_2021.shp TAS_COMM_ELECTORAL_POLYGON_shp.dbf TAS_COMM_ELECTORAL_POLYGON_shp.shp TAS_COMM_ELECTORAL_shp.dbf TAS_STATE_ELECTORAL_POLYGON_shp.dbf TAS_STATE_ELECTORAL_POLYGON_shp.shp TAS_STATE_ELECTORAL_shp.dbf tas_lga.shp tas_localities.shp TAS_STATE_POLYGON_shp.dbf TAS_STATE_POLYGON_shp.shp TAS_STATE_shp.dbf TAS_LOCALITY_shp.dbf TAS_LOCALITY_TOWN_shp.dbf TAS_TOWN_POINT_shp.dbf TAS_TOWN_POINT_shp.shp TAS_TOWN_shp.dbf Authority_Code_MB_CATEGORY_CLASS_AUT_shp.dbf Authority_Code_REMOTENESS_CATEGORY_AUT_shp.dbf Authority_Code_MB_CATEGORY_CLASS_AUT_shp.dbf Authority_Code_REMOTENESS_CATEGORY_AUT_shp.dbf Authority_Code_STATE_ELECTORAL_CLASS_AUT_shp.dbf Authority_Code_TOWN_CLASS_AUT_shp.dbf

minus34 commented 3 years ago

Cool, we've got the problem nailed down.

Curious that the behaviour is different between operating systems, but not as important as finding a universal fix...

I've just updated the code to filter out *_POINT_shp.dbf files (line 340). However, in the list above the elif filter should exclude all *_POLYGON_shp.dbf files as well; so this may not be a fix on CentOS...