This seems like a useful thing to do. While figuring out how to store and provide views on temporal change from non-temporal datasets in the edifice database is our own problem, for those users who merely want an up-to-date dataset, it would be nice to not have to re-download everything every night.
Any strategies for this? wget --spider will return the ultimately resolved URL and the file length without downloading the file. It seems plausible that a changed file on the data portal might also resolve to a URL with a new string — i.e. when I do:
I'm guessing that maybe when a new zip file gets put up there, that long string "9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs" will be changed. Can anyone confirm this?
(The file length — 120943 bytes — is also displayed when you use wget --spider. But obviously file length is an insufficient criteria for determining data modification).
We verified last night that for csv files that re-resolved long string mentioned above is not present, so we can't use that as a way to detect new versions.
One alternative approach may be using the API to find the date/time of when the file was updated, i.e. 'updated_at' in the Socrata SODA API? Has anyone used this successfully before?
The question then is how to best locally store the dates/times when the client last pulled down a given dataset. I could see an argument for having this be a special table.. but it may be better as a local flat file since the main script is also proficient at completely dropping your database.
This seems like a useful thing to do. While figuring out how to store and provide views on temporal change from non-temporal datasets in the edifice database is our own problem, for those users who merely want an up-to-date dataset, it would be nice to not have to re-download everything every night.
Any strategies for this? wget --spider will return the ultimately resolved URL and the file length without downloading the file. It seems plausible that a changed file on the data portal might also resolve to a URL with a new string — i.e. when I do:
That gets resolved to https://data.cityofchicago.org/api/file_data/9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs?filename=City%2520Boundary.zip .
I'm guessing that maybe when a new zip file gets put up there, that long string "9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs" will be changed. Can anyone confirm this?
(The file length — 120943 bytes — is also displayed when you use wget --spider. But obviously file length is an insufficient criteria for determining data modification).
We verified last night that for csv files that re-resolved long string mentioned above is not present, so we can't use that as a way to detect new versions.
One alternative approach may be using the API to find the date/time of when the file was updated, i.e. 'updated_at' in the Socrata SODA API? Has anyone used this successfully before?
The question then is how to best locally store the dates/times when the client last pulled down a given dataset. I could see an argument for having this be a special table.. but it may be better as a local flat file since the main script is also proficient at completely dropping your database.