Detecting modified files on the data portal (for setup_edifice.py)

mccc commented 11 years ago

This seems like a useful thing to do. While figuring out how to store and provide views on temporal change from non-temporal datasets in the edifice database is our own problem, for those users who merely want an up-to-date dataset, it would be nice to not have to re-download everything every night.

Any strategies for this? wget --spider will return the ultimately resolved URL and the file length without downloading the file. It seems plausible that a changed file on the data portal might also resolve to a URL with a new string — i.e. when I do:

wget --spider --no-check-certificate -O 'City Boundary.zip' http://data.cityofchicago.org/download/q38j-zgre/application/zip

That gets resolved to https://data.cityofchicago.org/api/file_data/9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs?filename=City%2520Boundary.zip .

I'm guessing that maybe when a new zip file gets put up there, that long string "9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs" will be changed. Can anyone confirm this?

(The file length — 120943 bytes — is also displayed when you use wget --spider. But obviously file length is an insufficient criteria for determining data modification).

mccc commented 11 years ago

We verified last night that for csv files that re-resolved long string mentioned above is not present, so we can't use that as a way to detect new versions.

One alternative approach may be using the API to find the date/time of when the file was updated, i.e. 'updated_at' in the Socrata SODA API? Has anyone used this successfully before?

The question then is how to best locally store the dates/times when the client last pulled down a given dataset. I could see an argument for having this be a special table.. but it may be better as a local flat file since the main script is also proficient at completely dropping your database.

derekeder commented 11 years ago

related to #2

mccc commented 11 years ago

On the topic of actually doing something once changed files have been detected (may need to be moved into a separate issue):

This is really something we're going to have to do very carefully. There may be an easy way to compare dataportal.building_footprints_old (or whatever) to a newly imported dataportal.building_footprints, but if there is a difference, its meaning needs to be elicited (were rows, and therefore buildings, added? [most likely scenario.] Did a field change for a single row [maybe someone added a porch to their house]? Were columns added? [Oops, should have used mongoDB..], etc.)

My pessimistic prediction is that it will be.. challenging to implement this at any higher level of abstraction than dataset by dataset, especially because the changes we care about are specific to the edifice.* tables (e.g. there are many values in Building_Footprints.shp where we don't care if they change).

In terms of columns being added, this is why I am definitely a supporter of having a parallel, optional mongo schema implemented as well, maybe after we get a stable postgres build working. I can see too many ways the latter will break/be touchy with any slight modification to the city's tables. It might even be worth chatting with the city to find out what datasets are least likely or most likely to change in structure in the near future. (Moreover it is clear that internally (e.g. WindyGrid) they are choosing to use document-oriented schemas themselves.)

maschinenmensch / edifice

Detecting modified files on the data portal (for setup_edifice.py) #15