simonw / shapefile-to-sqlite

Load shapefiles into a SQLite (optionally SpatiaLite) database
Apache License 2.0
28 stars 4 forks source link

Figure out primary keys and update/alter #4

Open simonw opened 4 years ago

simonw commented 4 years ago

geojson-to-sqlite supports both --pk= and --alter options.

I'm not sure if --pk makes sense for Shapefiles - they appear to have a unijque ID property that I could use, but I'm not certain if it is guaranteed to be present. It's been present in the files I've looked at so far.

Adding tests that exercise --alter plus upserting data into existing tables would be worthwhile.

simonw commented 4 years ago

I'm pretty confident that the "id" property Fiona loads from the shapefile (which starts at 0 and increments from there) can be considered unique. It appears to be the value that is used to attach shapes in the .shp file to property records in the .dbf DBase file.

I used the dbfread Python module to read the DBase file directly to see if it had a concept of an ID, and it doesn't appear to - suggesting that the index into the file is the only way to associate it with a shape:

In [1]: from dbfread import DBF                                                                                                                                          

In [2]: db = DBF("shape/scc_parkland_boundaries.dbf")                                                                                                                    

In [3]: next(iter(db))                                                                                                                                                   
Out[3]: 
OrderedDict([('PARK_NAME', 'Sanborn'),
             ('ADDRESS', '16055 Sanborn Rd.'),
             ('CITY_ZIP', 'Saratoga, CA 95070'),
             ('PRD_OPER', 'yes'),
             ('MNT_REG', 'Region 1'),
             ('RNGR_REG', 'Region 1'),
             ('SHAPE_Leng', 128308.882691),
             ('SHAPE_Area', 149564781.558),
             ('RespCostCn', '5809'),
             ('STATUS', 'open'),
             ('RuleID', 3),
             ('SHF_BEATS', 'P1'),
             ('PARK_ID', 'Sanborn'),
             ('ACRES', 3432.2485994),
             ('PARK_SUFFI', 'County Park')])

Finally, the ESRI shapefile specification at https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf says the following:

shapefile_pdf__page_29_of_34_
simonw commented 4 years ago

So... when importing a single shapefile into a single table it's safe for me to consider the id unique for the purpose of defining a primary key.

But how about loading multiple shapefiles into the same table? The tool supports this at the moment:

$ shapefile-to-sqlite features.db *.shp --table=features

But this won't work because of the duplicate primary keys.

This could be solved with a mode where the primary key becomes a string with a unique prefix for each incoming shapefile. Maybe it does that automatically if you attempt to load multiple shapefiles into the same table? Or maybe there's an option for that - which would be a bit more obvious, but people would have to remember to use it.

$ shapefile-to-sqlite features.db *.shp --table=features --prefix-pk
simonw commented 4 years ago

Now using .insert_all(..., replace=True) as of 22a5611

simonw commented 3 years ago

I just found a shapefile where the OBJECTID would make a better primary key:

OrderedDict([('OBJECTID', 1),
             ('CSAFP', '122'),
             ('CBSAFP', '12020'),
             ('GEOID', '12020'),
             ('NAME', 'Athens-Clarke County, GA'),
             ('NAMELSAD', 'Athens-Clarke County, GA Metro Area'),
             ('LSAD', 'M1'),
             ('MEMI', '1'),
             ('MTFCC', 'G3110'),
             ('ALAND', 2654601832),
             ('AWATER', 26140309),
             ('INTPTLAT', '+33.9439840'),
             ('INTPTLON', '-083.2138965'),
             ('Shape_Leng', 363485.4157899709),
             ('Shape_Area', 3905629939.0061007)])

https://data-usdot.opendata.arcgis.com/datasets/b0d0e777e2ad4b53803dbc0527c73d88_0

saulshanabrook commented 2 years ago

Hey @simonw, I am taking a look at this issue. I have resolved it locally using the --prefix-pk idea you mentioned, optionally prefixing the PK with the file path for each file you pass in.

Should I submit a PR for that?

simonw commented 2 years ago

Yes please! That diff looks good - just needs a test and a sentence of documentation.