opendata / CKAN-Multisite-Plans

Simplifying the process of launching an open data repository. [RETIRED]
Creative Commons Zero v1.0 Universal
20 stars 7 forks source link

1.5 Tracking API usage and other usage like disk space [Optional] #12

Open rossjones opened 9 years ago

rossjones commented 9 years ago
This is useful to Cloud Admin as measuring usage across different instances may 
become important to performance management.

Tracking API usage and other usage like disk space (and displaying this in admin 
interface) 
rossjones commented 9 years ago

Tracking of API usage would be important, but I think the monitoring of other aspects is already well handled by other tools built specifically for the purpose.

Perhaps extending CKAN in a way that some stats can be more easily captured by one of these systems would be a better general approach?

jqnatividad commented 9 years ago

Maybe the project can make https://github.com/ckan/ckanext-googleanalytics more robust. Again, consistent with the Unix toolchain approach, leveraging best-of-breed tools like Google Analytics.

Though GA is a "free as in beer" and not strictly "free as in speech" open source software, it has become the de facto standard.

With that said, the team should consider exposing the webserver logs of a CKAN instance as a dataset. In a SBIR study we did earlier this year about opendata, we found out that there is no simple way to measure the downstream usage of a dataset, which is a big signal that both data publishers and advocates need to prioritize data.

The existing reports are simply too coarse (only total aggregate views, downloads; no way to filter geotemporally). Of course, there should be some mechanism to control who has access to the webserver logs dataset. And better aggregated reports, much better than the existing ones, can be created and exposed to the general public.

From the webserver log dataset, you can even track if businesses, citizens, apps, other agencies are using the data. It can even be used to find downstream data users and automagically catalog them in CKAN's related items tab (e.g. visualizations/PDFs/sites using the https://github.com/BetaNYC/getDataButton, etc.)

waldoj commented 9 years ago

the team should consider exposing the webserver logs of a CKAN instance as a dataset

Huh. That's both clever and simple, my favorite combination of traits in an idea. :) Adding a new log config line to Apache could output a properly anonymized access log directly in the webroot. I like it!

wardi commented 9 years ago

ckan-multisite has all requests going through a single HTTP router, so the access logs for all the sites can be aggregated or reported on really easily. I've opened a ticket to revisit this when we have some code to show: https://github.com/boxkite/ckan-multisite/issues/4

jqnatividad commented 9 years ago

Great! We may also want to look at 18F's http://analytics.usa.gov for inspiration. Since we have full access logs, it doesn't directly apply, but once some instances "graduate" to their own dedicated open data installations, it may still be a way to aggregate high-level analytics.