Add migration to delete duplicate resources due to ODS harvesting

opendatateam / udata

Customizable and skinnable social platform dedicated to open data.

GNU Affero General Public License v3.0

239 stars 87 forks source link

Fix https://github.com/datagouv/data.gouv.fr/issues/1511

Remove duplicate OpenDataSoft resources. The duplicate are due to ODS modifying the URL of their CSV and XLSX exports. The URL being the identifier of the resources in our DCAT harvesting. Timeline of duplicates:

The default /export/csv had been existing since harvesting ODS with DCAT.
2024-08-07 : use_labels=false appended at the end of the URL -> first duplicates
2024-08-09 : replaced by use_labels=true -> second set of duplicates

Running the migration on a prod dump takes 6min and updates around 9,4K datasets, removing at least 2 duplicates resources per dataset. We have around 10 datasets that have been skipped due to an unexpected number/pattern of resources. :arrow_right: I think we should probably clean these erroneous datasets manually.

I would recommend running this migration in our preprod env and make sure that daily harvesting doesn't create new duplicates due to wrong URL modification.

Deployed in dev with the following migration logs:

            "udata:2024-10-01-remove-ods-duplicates ............................... [Apply]",
            "  │",
            "  │ Starting",
            "  │ 10007 datasets to process...",
            "  │ Skipping, 2 csv duplicate resources found for 66b44391aa492985b0a06f06. We're expecting 3 or 4",
            "  │ Skipping, 2 csv duplicate resources found for 66b40c68c73c0b6bcfa06f1f. We're expecting 3 or 4",
            "  │ Skipping, 1 csv duplicate resources found for 66b2e3eff25fb2d26ca06f03. We're expecting 3 or 4",
            "  │ Done !",
            "  │ Updated 10004 datasets. Failed on 0 objects.",
            "  │",
            "  │",
            "  └──[OK]"

Ex : this ODS dataset on data.gouv.fr (3 csv) vs on dev.data.gouv.fr (1 csv left)

2 days later

>>> Dataset.objects(created_at_internal__lte='2024-10-08', resources__created_at_internal__gte='2024-10-09', harvest__backend='DCAT').count()
69

No major resource nb increase :ok_hand: Most of these creation is infogreffe harvester being successful just now after months of failure

opendatateam / udata

Add migration to delete duplicate resources due to ODS harvesting #3158