Closed maudetes closed 2 weeks ago
Deployed in dev with the following migration logs:
"udata:2024-10-01-remove-ods-duplicates ............................... [Apply]",
" │",
" │ Starting",
" │ 10007 datasets to process...",
" │ Skipping, 2 csv duplicate resources found for 66b44391aa492985b0a06f06. We're expecting 3 or 4",
" │ Skipping, 2 csv duplicate resources found for 66b40c68c73c0b6bcfa06f1f. We're expecting 3 or 4",
" │ Skipping, 1 csv duplicate resources found for 66b2e3eff25fb2d26ca06f03. We're expecting 3 or 4",
" │ Done !",
" │ Updated 10004 datasets. Failed on 0 objects.",
" │",
" │",
" └──[OK]"
Ex : this ODS dataset on data.gouv.fr (3 csv) vs on dev.data.gouv.fr (1 csv left)
2 days later
>>> Dataset.objects(created_at_internal__lte='2024-10-08', resources__created_at_internal__gte='2024-10-09', harvest__backend='DCAT').count()
69
No major resource nb increase :ok_hand: Most of these creation is infogreffe harvester being successful just now after months of failure
Fix https://github.com/datagouv/data.gouv.fr/issues/1511
Remove duplicate OpenDataSoft resources. The duplicate are due to ODS modifying the URL of their CSV and XLSX exports. The URL being the identifier of the resources in our DCAT harvesting. Timeline of duplicates:
/export/csv
had been existing since harvesting ODS with DCAT.use_labels=false
appended at the end of the URL -> first duplicatesuse_labels=true
-> second set of duplicatesRunning the migration on a prod dump takes 6min and updates around 9,4K datasets, removing at least 2 duplicates resources per dataset. We have around 10 datasets that have been skipped due to an unexpected number/pattern of resources. :arrow_right: I think we should probably clean these erroneous datasets manually.
I would recommend running this migration in our preprod env and make sure that daily harvesting doesn't create new duplicates due to wrong URL modification.