tplagge / benefice

6 stars 2 forks source link

Detecting modified files on the data portal (for setup_edifice.py) #3

Open derekeder opened 11 years ago

derekeder commented 11 years ago

This seems like a useful thing to do. While figuring out how to store and provide views on temporal change from non-temporal datasets in the edifice database is our own problem, for those users who merely want an up-to-date dataset, it would be nice to not have to re-download everything every night.

Any strategies for this? wget --spider will return the ultimately resolved URL and the file length without downloading the file. It seems plausible that a changed file on the data portal might also resolve to a URL with a new string — i.e. when I do:

wget --spider --no-check-certificate -O 'City Boundary.zip' http://data.cityofchicago.org/download/q38j-zgre/application/zip

That gets resolved to https://data.cityofchicago.org/api/file_data/9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs?filename=City%2520Boundary.zip .

I'm guessing that maybe when a new zip file gets put up there, that long string "9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs" will be changed. Can anyone confirm this?

(The file length — 120943 bytes — is also displayed when you use wget --spider. But obviously file length is an insufficient criteria for determining data modification).

We verified last night that for csv files that re-resolved long string mentioned above is not present, so we can't use that as a way to detect new versions.


One alternative approach may be using the API to find the date/time of when the file was updated, i.e. 'updated_at' in the Socrata SODA API? Has anyone used this successfully before?

The question then is how to best locally store the dates/times when the client last pulled down a given dataset. I could see an argument for having this be a special table.. but it may be better as a local flat file since the main script is also proficient at completely dropping your database.