Measure is scripts and conventions to build KPI dashboards for projects.
We need to be more proactive in collecting useful data on the projects we run, and using this data to measure success and failure.
We also have an internal proof of concept of Measure collecting data from several different data sources for several different projects.
The data currently gets written to Google Sheets directly, and visualisation is provided by the visualisation features in Google Sheets.
We have demonstrated the value of this data collection as part of the project lifecycle for a range of internal and external stakeholders.
The main change here is having a clean, openly available codebase, and using a more suitable database and dashboard builder, as well as adding additional collectors.
Potentially, we'd love to see interest from other non-profits who receive funds to execute on projects, and would like a simple yet systematic way to collect data on what they do.
Each project has a measure.source-spec.yaml
configuration file within a project directory in /projects
, e.g. for the Frictionless Data project:
/projects/
└── frictionlessdata
└── measure.source-spec.yaml
└── anotherproject
└── measure.source-spec.yaml
The YAML file defines the project name, and configuration settings for each data source we want to measure. Data sources are grouped by theme, e.g. code-hosting
, social-media
, and code-packaging
. Under each theme is the specific configuration for each data source. Here is an example of the basic structure for a project configuration file:
# measure.source-spec.yaml
project: frictionlessdata
config:
code-hosting: # <------- theme
github: # <---------- data source
repositories: # <-- data source settings
- "frictionlessdata/jsontableschema-models-js"
- "frictionlessdata/datapackage-pipelines"
[...]
social-media:
twitter:
entities:
- "#frictionlessdata"
- "#datapackages"
[...]
Below is the specific configuration settings for each type of data source.
The Github processor collects data about each repository listed in the repositories
section. For each repository, the processor collects the number of:
config:
code-hosting:
github:
repositories:
- "frictionlessdata/jsontableschema-models-js"
- "frictionlessdata/datapackage-pipelines"
Requesting the number of issues and pull requests uses the Github search api (four requests per repository). Search api requests are rate limited to 30 per minute for authenticated requests. By default, Measure will wait 3 seconds before each search request to ensure it stays within the rate limit. If you know your configurations will stay within the rate limit (fewer than 8 repositories are defined), then you can set the wait interval to 0
by using the MEASURE_GITHUB_REQUEST_WAIT_INTERVAL
env var (see below). This is not recommended.
The NPM processor collects data from the Node Package Manager (NPM) service where our Node and Javascript projects are hosted for distribution. The processor collects the number of daily downloads
for each package listed in the packages
section of the project configuration. What is meant by 'downloads' is discussed in this blog post.
config:
code-packaging:
npm:
packages:
- 'jsontableschema'
- 'goodtables'
- 'tableschema'
If no data has previously been collected for a particular package, the NPM processor will request daily data for all days since the beginning of the project.
The PyPI processor collects data from the Python Package Index (PyPI) where our Python projects are hosted for distribution. The processor collects the number of daily downloads
for each package listed in the packages
section of the project configuration.
config:
code-packaging:
pypi:
packages:
- 'jsontableschema'
- 'goodtables'
- 'tableschema'
If no data has previously been collected for a particular package, the processor will requests daily data from the start date of PyPI's BigQuery database (2016-01-22).
The PyPI processor requires a Google API account with generated credential to make BigQuery queries.
The RubyGems processor collects ruby gem download data from the rubygems.org API.
total_downloads
value, if present.config:
code-packaging:
rubygems:
gems:
- "tableschema"
- "datapackage"
No historical download data is collected for RubyGems.
The Packagist processor collects PHP package daily download data from the packagist.org API.
config:
code-packaging:
packagist:
packages:
- "frictionlessdata/tableschema"
- "frictionlessdata/datapackage"
Note: packages
defined in the config must include their owner organization in the form organization_name/package_name
.
Results from the Packagist.org API appear to be a couple of days behind.
The Twitter processor collects data about each entity listed in the entities
section. Entities can be one of the following:
#hashtag
: a twitter hash tag@account
: an account nameurl:search-term
: a search term as part of a urlFor each entity, the processor collects:
And additionally, for account entities:
Url search terms are used to find urls mentioned in tweets. It is best to leave off http://
prefixes. Urls searches for just the domain will be less specific (will return more results) than for url searches that include a path, e.g.: url:blog.okfn.org
will return more results than the more specific search, url:blog.okfn.org/2017/
, which in turn will return more results than url:blog.okfn.org/2017/06/15/the-final-global-open-data-index-is-now-live/
.
config:
social-media:
twitter:
entities:
- "#frictionlessdata"
- "#datapackages"
- "@okfnlabs"
- "url:frictionlessdata.io"
The Facebook processor collects data about each page listed in the pages
section. For each page, the processor collects:
config:
social-media:
facebook:
pages:
- "OKFNetwork"
Each page listed in the project config file will require a Facebook Page Access Token to be generated and added to the app's Environmental Variables.
Get Admin permissions for the Page you wish to track:
Create a Facebook App:
Create a Page Access Token:
Extend Access Token:
Get Permanent Page Access Token
Add this token to the Environmental Variables for the Measure application
MEASURE_FACEBOOK_API_ACCESS_TOKEN_OKFNETWORK='{the OKFNetwork page token obtained above}'
The Google Analytics processor collects visitor data for specified domain. For each domain the following is collected:
Domains are specified in the project configuration and require the domain url
and Google Analytics viewid
.
config:
website-analytics:
ga:
domains:
- url: 'frictionlessdata.io'
viewid: '120195554'
- url: 'specs.frictionlessdata.io'
viewid: '57520245'
Each viewid
can be found within your Google Analytics account. See this short video for guidance.
The Google Analytics processor requires a Google API account with the Google Analytics Reporting API enabled.
'Outputs' refers to secondary events and products related to a project, e.g. blog posts, talks given, or tangible uses of our products. These can be either internally produced, or external.
We capture these outputs manually using Google Forms, which writes the results to a Google Spreadsheet.
The Outputs processor requires a Google API account with generated credentials to read private Google Spreadsheet URLs. This is an explanation of what is collected by the processor:
{sheetid}/{gid}
sheetid
and gid
, which are part of the spreadsheet URL:
https://docs.google.com/spreadsheets/d/{sheetid}/edit#gid={gid}
# sheetid and gid correspond with the parts of the spreadsheet url:
# https://docs.google.com/spreadsheets/d/{sheetid}/edit#gid={gid}
config:
outputs:
- sheetid: "{sheetid from above}"
gid: "{gid from above}"
type: "external" # the type of outputs captured here
- sheetid: "{another sheetid}"
gid: "{another gid}"
type: "internal"
The MailChimp processor collects email list data each day. For each list the following is collected:
The processor will attempt to collect historic data upto the creation date of the list. Complete data is collected for subs
, unsubs
, and campaigns_sent
. Partial historic data is collected for subscribers
; once for the last day of each month when collecting historic data.
List ids are added to the project config file:
config:
email:
mailchimp:
lists:
- 'my-mailchimp-list-id'
- 'another-mailchimp-list-id'
A MailChimp API key must be defined as an environmental variable. See below for details.
The Discourse processor collects daily forum data from an instance of a Discourse forum. For each domain listed in the config the following is collected:
The processor will collect historic data for all properties up to 2014-01-01 (except active_users
).
config:
forums:
discourse:
domains:
- 'discourse.example.com'
Closely related to the Forums pipeline above, providing a table to collect data about specific categories within a forum.
This processor collects daily data for each specified category within a Discourse forum domain. Each category listed for a domain will have the following data collected:
config:
forum-categories:
discourse-categories:
- domain: 'discourse.example.com'
categories:
- name: 'my-toplevel-category' # will collect subcategory data
children: 'expand'
- name: 'another-toplevel-category' # will aggregate subcategory data
children: 'aggregate'
- name: 'a-category' # no `children` defined, will ignore subcategory data
Each category needs to specify a name
, which is the slugified version of the category name (as it appears in the Discourse forum url), and an optional children
parameter. The children
parameter will determine how subcategories of the specified category are treated. It can accept one of:
none
: Do not collect any data about subcategories of name
. This is the default behaviour.aggregate
: Collect data for each subcategory of name
, and add it to the appropriate value on name
.expand
: Collect data for each subcategory of name
, and add them as separate rows, as if they had been explicitly defined.Each installation of Measure requires certain environmental variables to be set.
MEASURE_DB_ENGINE
: Location of SQL database as a URL SchemaMEASURE_TIMESTAMP_DEFAULT_FORMAT
: datetime format used for timestamp
value. Currently must be %Y-%m-%dT%H:%M:%SZ
.MEASURE_GITHUB_API_BASE_URL
: Github API base url (https://api.github.com
)MEASURE_GITHUB_API_TOKEN
: Github API token used for making requestsMEASURE_GITHUB_REQUEST_WAIT_INTERVAL
: Wait interval in seconds between Github search requests (optional, default is 3)MEASURE_TWITTER_API_CONSUMER_KEY
: Twitter app API consumer keyMEASURE_TWITTER_API_CONSUMER_SECRET
: Twitter app API consumer secretMEASURE_FACEBOOK_API_ACCESS_TOKEN_{PAGE NAME IN UPPERCASE}
: The page access token obtained from How to get a Facebook Page Access Token.See the PyPI Big Query API instructions above to get the values for these env vars:
MEASURE_GOOGLE_API_PROJECT_ID
: {project_id}MEASURE_GOOGLE_API_JWT_AUTH_PROVIDER_X509_CERT_URL
: {auth_provider_x509_cert_url}MEASURE_GOOGLE_API_JWT_AUTH_URI
: {auth_uri}MEASURE_GOOGLE_API_JWT_CLIENT_EMAIL
: {client_email}MEASURE_GOOGLE_API_JWT_CLIENT_ID
: {client_id}MEASURE_GOOGLE_API_JWT_CLIENT_X509_CERT_URL
: {client_x509_cert_url}MEASURE_GOOGLE_API_JWT_PRIVATE_KEY
: {private_key}MEASURE_GOOGLE_API_JWT_PRIVATE_KEY_ID
: {private_key_id}MEASURE_GOOGLE_API_JWT_TOKEN_URI
: {token_uri}MEASURE_GOOGLE_API_JWT_TYPE
: {type}MEASURE_MAILCHIMP_API_TOKEN
: {mailchimp_api_key} (note: must include the data center code, e.g. 123abc456def-dc1
, where dc1
is the data center code).MEASURE_DISCOURSE_API_TOKEN
: {discourse_api_token} used to access /admin
endpoints.