project-open-data / csv-to-api

Proof of concept to dynamically generate RESTful APIs from static CSVs
http://labs.data.gov/csv-to-api/
325 stars 81 forks source link

Update of CSV file #25

Closed philippejuhel closed 9 years ago

philippejuhel commented 10 years ago

When I first put a new file like data.csv, it’s OK. Now, I replace this version of data.csv by a new version (with more rows for example). On a browser, when I put the same URL http://timmy.icam-toulouse.fr/i48/?source=http://timmy.icam-toulouse.fr/eleve/data.csv&format=json, I still have the content of the PREVIOUS version of the file, not the new content. Why? I even try to REMOVE this data.csv file and it still works, amazing.

waldoj commented 10 years ago

This is an interesting problem. In the ways in which this tool has been used, updates haven't been a problem, because it's been used for datasets that rarely change. The reason that the previous version is stored is because this caches data aggressively, using APC. This stores the contents of data.csv in memory, instead of loading it off the hard drive, which is much faster. Every hour, data.csv is reloaded from the original file. So even if you delete data.csv, as you discovered, the API will keep serving it, for up to one hour.

There are several existing solutions to this problem:

  1. Wait. When the cache expires (anywhere from 0–60 minutes later), the file will be updated.
  2. Clear APC's cache. The PHP command for this is apc_clear_cache('user'). Just create a file that consists of <?php apc_clear_cache('user'); ?>, save it to your server, and load it in your browser.
  3. Restart your web server (e.g., Apache, Nginx). That will clear APC's cache.

I cannot think of a way that we can force a refresh of the cache within this program that doesn't allow users to abuse the service. For example, imagine a URL like http://example.com/api/?source=http://example.net/data.csv&refresh=true to reload the file from disk. Anybody making such a request would be able to use significantly more system resources than a regular query, and they might even have an incentive to do so, believing that they will get fresher data by avoiding built-in caching. Technologically, making it possible to force a refresh of the cache is quite easy, but I can't seem to think of a way to do this that would only be available to the sysadmin, since our only interface is through the URL.

gbinal commented 9 years ago

Thanks for the thorough follow up, @waldoj. I'm going to go ahead and close this issue, but let us know any other questions or needs, @philippejuhel.