opendatateam / udata

Customizable and skinnable social platform dedicated to open data.
http://udata.readthedocs.org
GNU Affero General Public License v3.0
240 stars 87 forks source link

resource.fs_filename and resource.url are sometimes unsynced #2544

Open abulte opened 4 years ago

abulte commented 4 years ago

Sometimes, especially on community resources from transport.data.gouv.fr, our fs_filename is not synced with url. After our cleanup, it lead to purge-datasets failing because it tries to remove a not existing file.

Tests done so far

On demo.data.gouv.fr:

CommunityResources

300+ occurrences, cf complete list https://gist.github.com/abulte/f283a2c2e3dc9102d8f767f0c908637e.

It happened again this morning, cf

offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs,08422838-434a-4a7d-907a-7d43b57b8639,https://static.data.gouv.fr/resources/offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs/20201001-081210/reseau-lr-gtfs-20200924.zip.netex.zip,offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs/20201001-081632/reseau-lr-gtfs-20200924.zip.netex.zip

Resources

>>> for d in Dataset.objects:
...     for r in d.resources:
...             if r.fs_filename and (not r.url.endswith(r.fs_filename)):
...                     print(d.id, r.id, r.url, r.fs_filename)
...
5d13a8b6634f41070a43dff3 1ac234c7-1da4-49cf-a122-646b21d64b43 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074922/export-tag-20200926-074922.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074919/export-tag-20200822-074919.csv
5d13a8b6634f41070a43dff3 970aafa0-3778-4d8b-b9d1-de937525e379 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074920/export-reuse-20200926-074920.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074917/export-reuse-20200822-074917.csv
5d13a8b6634f41070a43dff3 b7bbfedc-2448-4135-a6c7-104548d396e7 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074909/export-organization-20200926-074909.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074906/export-organization-20200822-074906.csv
5d13a8b6634f41070a43dff3 d77705e1-4ecd-461c-8c24-662d47c4c2f9 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074906/export-discussion-20200926-074906.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074903/export-discussion-20200822-074903.csv
5d13a8b6634f41070a43dff3 4babf5f2-6a9c-45b5-9144-ca5eae6a7a6d https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074811/export-resource-20200926-074811.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074809/export-resource-20200822-074809.csv
5d13a8b6634f41070a43dff3 f868cca6-8da1-4369-a78d-47463f19a9a3 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074505/export-dataset-20200926-074505.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074502/export-dataset-20200822-074502.csv
5448d3e0c751df01f85d0572 50625621-18bd-43cb-8fde-6b8c24bdabb3 https://static.data.gouv.fr/resources/fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200920-224338/bornes-irve-20200920.csv fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200820-224444/bornes-irve-20200820.csv

Cf #2542 for catalogue-des-donnees-de-data-gouv-fr. No idea why it failed for fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques, this is a pretty standard API upload.

abulte commented 4 years ago

About community resources, this was caused by transport's script which forced the URL to a previous when doing the PUT to update metadata. It's been fixed but there's still 87 dangling resources https://gist.github.com/abulte/f283a2c2e3dc9102d8f767f0c908637e#file-cr-unsynced-fs-filename-v2-csv. They can be removed.

Going further: we probably should not allow setting the URL from the API when a resource is of type file (ie not remote). This would have prevented this whole mess.

abulte commented 4 years ago

Keeping this open since this one is stil unexplained:

5448d3e0c751df01f85d0572 50625621-18bd-43cb-8fde-6b8c24bdabb3 https://static.data.gouv.fr/resources/fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200920-224338/bornes-irve-20200920.csv fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200820-224444/bornes-irve-20200820.csv

and #2542 must be fixed too.