opencivicdata / docs.opencivicdata.org

Open Civic Data project documentation
https://open-civic-data.readthedocs.io
44 stars 33 forks source link

Best practice for generating OCD IDs #103

Open todrobbins opened 6 years ago

todrobbins commented 6 years ago

I've seen UUIDs within California Civic Data datasets (e.g. https://calaccess.californiacivicdata.org/documentation/processed-files/ballot-measures/) and wondered if there are best practices for ID generation. Thanks!

Examples:

gordonje commented 6 years ago

@todrobbins I can only really speak to how the OCD ids are implemented, if that's helpful.

OCDIDField is a custom Django field from which the id fields on Election, BallotMeasureContest and other models all inherit. There's an ocd_type kwarg for setting the prefix before the UUID.

The UUID itself is randomly generated via Python's builtin uuid.uuid4().

python-opencivicdata had all this set up for us before we came along and implemented the election module. The bigger challenge for us was ensuring that our daily ETL process preserves the previously generated ids without inserting duplicate records.

If you're working on something outside the OCD ecosystem, but still in Django, you might consider just using the UUIDField.

Also, if you're storing your data in postgres, either pgcrypto or uuid-ossp are useful extensions.

Over in django-calaccess-processed-data, we're using pgcrypto's gen_random_uuid() function to create the OCD ids in bulk, for example, when creating hundreds of thousands of filings in bulk.

Hope that's helpful. If you're looking for more general guidance about assigning ids for data intended for public consumption, I think this is something @fgregg has been researching recently.