Deleting objects does not delete identifiers associated with them. When orphaned identifiers hang out, any subsequent object with the same identifier creates duplicates. Some instances in which this is an issue is if data is removed by accident, or if it's desirable to remove data before a scrape can correct it, e.g., to prevent the spread of erroneous information.
A practical example: We scrape events from the Legistar API and use the unique event ID as an identifier for events. This week, we needed to remove a batch of test events, some with errors, and rely on the scrape to repopulate the events that did not contain errors. This resulted in a duplicate of every correct event that was removed, for each scrape we ran.
Something like hooking into delete signals for the top-level models in python-opencivicdata and removing any associated identifiers on removal might work, though it wouldn't cover removing data at the database level, since signals wouldn't fire. A database trigger implemented in a migration could cover data removal at the ORM or database level, though that would be less obvious to the end user.
In the meantime, this issue can be mitigated by ensuring identifiers are sufficiently unique and carefully deleting data, but I think it would be nice to think about for a future release.
Related to #295.
Deleting objects does not delete identifiers associated with them. When orphaned identifiers hang out, any subsequent object with the same identifier creates duplicates. Some instances in which this is an issue is if data is removed by accident, or if it's desirable to remove data before a scrape can correct it, e.g., to prevent the spread of erroneous information.
A practical example: We scrape events from the Legistar API and use the unique event ID as an identifier for events. This week, we needed to remove a batch of test events, some with errors, and rely on the scrape to repopulate the events that did not contain errors. This resulted in a duplicate of every correct event that was removed, for each scrape we ran.
Something like hooking into delete signals for the top-level models in
python-opencivicdata
and removing any associated identifiers on removal might work, though it wouldn't cover removing data at the database level, since signals wouldn't fire. A database trigger implemented in a migration could cover data removal at the ORM or database level, though that would be less obvious to the end user.In the meantime, this issue can be mitigated by ensuring identifiers are sufficiently unique and carefully deleting data, but I think it would be nice to think about for a future release.
As ever, thanks for your work!