Open simonw opened 4 years ago
I'm pretty confident that the "id" property Fiona loads from the shapefile (which starts at 0 and increments from there) can be considered unique. It appears to be the value that is used to attach shapes in the .shp
file to property records in the .dbf
DBase file.
I used the dbfread Python module to read the DBase file directly to see if it had a concept of an ID, and it doesn't appear to - suggesting that the index into the file is the only way to associate it with a shape:
In [1]: from dbfread import DBF
In [2]: db = DBF("shape/scc_parkland_boundaries.dbf")
In [3]: next(iter(db))
Out[3]:
OrderedDict([('PARK_NAME', 'Sanborn'),
('ADDRESS', '16055 Sanborn Rd.'),
('CITY_ZIP', 'Saratoga, CA 95070'),
('PRD_OPER', 'yes'),
('MNT_REG', 'Region 1'),
('RNGR_REG', 'Region 1'),
('SHAPE_Leng', 128308.882691),
('SHAPE_Area', 149564781.558),
('RespCostCn', '5809'),
('STATUS', 'open'),
('RuleID', 3),
('SHF_BEATS', 'P1'),
('PARK_ID', 'Sanborn'),
('ACRES', 3432.2485994),
('PARK_SUFFI', 'County Park')])
Finally, the ESRI shapefile specification at https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf says the following:
So... when importing a single shapefile into a single table it's safe for me to consider the id
unique for the purpose of defining a primary key.
But how about loading multiple shapefiles into the same table? The tool supports this at the moment:
$ shapefile-to-sqlite features.db *.shp --table=features
But this won't work because of the duplicate primary keys.
This could be solved with a mode where the primary key becomes a string with a unique prefix for each incoming shapefile. Maybe it does that automatically if you attempt to load multiple shapefiles into the same table? Or maybe there's an option for that - which would be a bit more obvious, but people would have to remember to use it.
$ shapefile-to-sqlite features.db *.shp --table=features --prefix-pk
Now using .insert_all(..., replace=True)
as of 22a5611
I just found a shapefile where the OBJECTID
would make a better primary key:
OrderedDict([('OBJECTID', 1),
('CSAFP', '122'),
('CBSAFP', '12020'),
('GEOID', '12020'),
('NAME', 'Athens-Clarke County, GA'),
('NAMELSAD', 'Athens-Clarke County, GA Metro Area'),
('LSAD', 'M1'),
('MEMI', '1'),
('MTFCC', 'G3110'),
('ALAND', 2654601832),
('AWATER', 26140309),
('INTPTLAT', '+33.9439840'),
('INTPTLON', '-083.2138965'),
('Shape_Leng', 363485.4157899709),
('Shape_Area', 3905629939.0061007)])
https://data-usdot.opendata.arcgis.com/datasets/b0d0e777e2ad4b53803dbc0527c73d88_0
Hey @simonw, I am taking a look at this issue. I have resolved it locally using the --prefix-pk
idea you mentioned, optionally prefixing the PK with the file path for each file you pass in.
Should I submit a PR for that?
Yes please! That diff looks good - just needs a test and a sentence of documentation.
geojson-to-sqlite
supports both--pk=
and--alter
options.I'm not sure if
--pk
makes sense for Shapefiles - they appear to have a unijque ID property that I could use, but I'm not certain if it is guaranteed to be present. It's been present in the files I've looked at so far.Adding tests that exercise
--alter
plus upserting data into existing tables would be worthwhile.