opendatateam / udata

Customizable and skinnable social platform dedicated to open data.
http://udata.readthedocs.org
GNU Affero General Public License v3.0
239 stars 87 forks source link

Add migration to delete duplicate resources due to ODS harvesting #3158

Closed maudetes closed 2 weeks ago

maudetes commented 3 weeks ago

Fix https://github.com/datagouv/data.gouv.fr/issues/1511

Remove duplicate OpenDataSoft resources. The duplicate are due to ODS modifying the URL of their CSV and XLSX exports. The URL being the identifier of the resources in our DCAT harvesting. Timeline of duplicates:

Running the migration on a prod dump takes 6min and updates around 9,4K datasets, removing at least 2 duplicates resources per dataset. We have around 10 datasets that have been skipped due to an unexpected number/pattern of resources. :arrow_right: I think we should probably clean these erroneous datasets manually.

I would recommend running this migration in our preprod env and make sure that daily harvesting doesn't create new duplicates due to wrong URL modification.

maudetes commented 2 weeks ago

Deployed in dev with the following migration logs:

            "udata:2024-10-01-remove-ods-duplicates ............................... [Apply]",
            "  │",
            "  │ Starting",
            "  │ 10007 datasets to process...",
            "  │ Skipping, 2 csv duplicate resources found for 66b44391aa492985b0a06f06. We're expecting 3 or 4",
            "  │ Skipping, 2 csv duplicate resources found for 66b40c68c73c0b6bcfa06f1f. We're expecting 3 or 4",
            "  │ Skipping, 1 csv duplicate resources found for 66b2e3eff25fb2d26ca06f03. We're expecting 3 or 4",
            "  │ Done !",
            "  │ Updated 10004 datasets. Failed on 0 objects.",
            "  │",
            "  │",
            "  └──[OK]"

Ex : this ODS dataset on data.gouv.fr (3 csv) vs on dev.data.gouv.fr (1 csv left)


2 days later

>>> Dataset.objects(created_at_internal__lte='2024-10-08', resources__created_at_internal__gte='2024-10-09', harvest__backend='DCAT').count()
69

No major resource nb increase :ok_hand: Most of these creation is infogreffe harvester being successful just now after months of failure