okfn / measure

Measure is scripts and conventions to build KPI dashboards for projects.
MIT License
17 stars 5 forks source link

Measure

Travis Coveralls

What

Measure is scripts and conventions to build KPI dashboards for projects.

Why

We need to be more proactive in collecting useful data on the projects we run, and using this data to measure success and failure.

Context

We also have an internal proof of concept of Measure collecting data from several different data sources for several different projects.

The data currently gets written to Google Sheets directly, and visualisation is provided by the visualisation features in Google Sheets.

We have demonstrated the value of this data collection as part of the project lifecycle for a range of internal and external stakeholders.

The main change here is having a clean, openly available codebase, and using a more suitable database and dashboard builder, as well as adding additional collectors.

Potentially, we'd love to see interest from other non-profits who receive funds to execute on projects, and would like a simple yet systematic way to collect data on what they do.

Project Configuration

Each project has a measure.source-spec.yaml configuration file within a project directory in /projects, e.g. for the Frictionless Data project:

/projects/
└── frictionlessdata
    └── measure.source-spec.yaml
└── anotherproject
    └── measure.source-spec.yaml

The YAML file defines the project name, and configuration settings for each data source we want to measure. Data sources are grouped by theme, e.g. code-hosting, social-media, and code-packaging. Under each theme is the specific configuration for each data source. Here is an example of the basic structure for a project configuration file:

# measure.source-spec.yaml

project: frictionlessdata

config:
  code-hosting: # <------- theme
    github:  # <---------- data source
      repositories:  # <-- data source settings
        - "frictionlessdata/jsontableschema-models-js"
        - "frictionlessdata/datapackage-pipelines"
        [...]

  social-media:
    twitter:
      entities:
        - "#frictionlessdata"
        - "#datapackages"
        [...]

Below is the specific configuration settings for each type of data source.

Code Hosting

Github

The Github processor collects data about each repository listed in the repositories section. For each repository, the processor collects the number of:

config:
  code-hosting:
    github:
      repositories:
        - "frictionlessdata/jsontableschema-models-js"
        - "frictionlessdata/datapackage-pipelines"

Requesting the number of issues and pull requests uses the Github search api (four requests per repository). Search api requests are rate limited to 30 per minute for authenticated requests. By default, Measure will wait 3 seconds before each search request to ensure it stays within the rate limit. If you know your configurations will stay within the rate limit (fewer than 8 repositories are defined), then you can set the wait interval to 0 by using the MEASURE_GITHUB_REQUEST_WAIT_INTERVAL env var (see below). This is not recommended.

Code Packaging

NPM

The NPM processor collects data from the Node Package Manager (NPM) service where our Node and Javascript projects are hosted for distribution. The processor collects the number of daily downloads for each package listed in the packages section of the project configuration. What is meant by 'downloads' is discussed in this blog post.

config:
  code-packaging:
    npm:
      packages:
        - 'jsontableschema'
        - 'goodtables'
        - 'tableschema'

If no data has previously been collected for a particular package, the NPM processor will request daily data for all days since the beginning of the project.

PyPI

The PyPI processor collects data from the Python Package Index (PyPI) where our Python projects are hosted for distribution. The processor collects the number of daily downloads for each package listed in the packages section of the project configuration.

config:
  code-packaging:
    pypi:
      packages:
        - 'jsontableschema'
        - 'goodtables'
        - 'tableschema'        

If no data has previously been collected for a particular package, the processor will requests daily data from the start date of PyPI's BigQuery database (2016-01-22).

PyPI Configuration

The PyPI processor requires a Google API account with generated credential to make BigQuery queries.

  1. Go to your Google Cloud Platform Console
  2. Pick or Create the Project you want
  3. Use Google API, Enable API, search for Big Query API, click ENABLE
  4. Click Go To Credentials, there choose Service Account Credentials
  5. Click on Options symbol for App Engine default service account, click Create Key
  6. Choose Key Type to be JSON
  7. The downloaded file will have all the credentials you need. Keep them safe, and use them to populate the environmental variables below.

RubyGems

The RubyGems processor collects ruby gem download data from the rubygems.org API.

config:
  code-packaging:
    rubygems:
      gems:
        - "tableschema"
        - "datapackage"

No historical download data is collected for RubyGems.

Packagist

The Packagist processor collects PHP package daily download data from the packagist.org API.

config:
  code-packaging:
    packagist:
      packages:
        - "frictionlessdata/tableschema"
        - "frictionlessdata/datapackage"

Note: packages defined in the config must include their owner organization in the form organization_name/package_name.

Results from the Packagist.org API appear to be a couple of days behind.

Social Media

Twitter

The Twitter processor collects data about each entity listed in the entities section. Entities can be one of the following:

For each entity, the processor collects:

And additionally, for account entities:

Url search terms are used to find urls mentioned in tweets. It is best to leave off http:// prefixes. Urls searches for just the domain will be less specific (will return more results) than for url searches that include a path, e.g.: url:blog.okfn.org will return more results than the more specific search, url:blog.okfn.org/2017/, which in turn will return more results than url:blog.okfn.org/2017/06/15/the-final-global-open-data-index-is-now-live/.

config:
  social-media:
    twitter:
      entities:
        - "#frictionlessdata"
        - "#datapackages"
        - "@okfnlabs"
        - "url:frictionlessdata.io"

Facebook

The Facebook processor collects data about each page listed in the pages section. For each page, the processor collects:

config:
  social-media:
    facebook:
      pages:
        - "OKFNetwork"

Each page listed in the project config file will require a Facebook Page Access Token to be generated and added to the app's Environmental Variables.

How to get a Facebook Page Access Token
  1. Get Admin permissions for the Page you wish to track:

    • Go to the Page's settings page
    • Choose the pane Page Roles
    • Add the User that sets the token as an Analyst or above
  2. Create a Facebook App:

    • Note: If you already a Facebook App for an existing Measure project, you can reuse it and skip this step
    • Go to Facebook Developers
    • On the upper right menu, select My Apps select Add a New App
    • Fill in the details, app name and email address, and and click Create App ID
  3. Create a Page Access Token:

    • Go to Facebook API Explorer
    • On the top-right, choose the application you created previously
    • Below it, open the Get Token dropdown, and choose Get User Token
    • In the opened window, check the read_insights and manage_pages permissions
    • Click on Get Access Token, and approve
    • Now open again the Get Token dropdown, and choose Get Page Access Token
    • In the dialog, give the app the permissions it requires (particularly, manage pages)
    • Choose the page you wish to track from the dropdown
    • You now have a short-living access token, and it needs to be extended
  4. Extend Access Token:

    • Still in the same view of the API Explorer, next to the Access Token that appeared, click on the blue exclamation mark
    • You'll see the Token's info. Click on Open in Access Token Tool
    • In the window, Click on Extend Access Token
    • You can use this token, but note it's expiration date - by then you'll need to either extend it or replace it. Or you can proceed to next step to obtain a permanent page access token.
  5. Get Permanent Page Access Token

    • Go to Graph API Explorer
    • Select your app in Application
    • Paste the long-lived access token into Access Token
    • Next to Access Token, choose the page you want an access token for. The access token appears as a new string.
    • Click i to see the properties of this access token
    • Click “Open in Access Token Tool” button again to open the “Access Token Debugger” tool to check the properties
  6. Add this token to the Environmental Variables for the Measure application

    • Each page must have its own env var to store its token. e.g. for the OKFNetwork page: MEASURE_FACEBOOK_API_ACCESS_TOKEN_OKFNETWORK='{the OKFNetwork page token obtained above}'

Website Analytics

Google Analytics

The Google Analytics processor collects visitor data for specified domain. For each domain the following is collected:

Domains are specified in the project configuration and require the domain url and Google Analytics viewid.

config:
  website-analytics:
    ga:
      domains:
        - url: 'frictionlessdata.io'
          viewid: '120195554'
        - url: 'specs.frictionlessdata.io'
          viewid: '57520245'

Each viewid can be found within your Google Analytics account. See this short video for guidance.

Google Analytics Configuration

The Google Analytics processor requires a Google API account with the Google Analytics Reporting API enabled.

  1. Enable Google Analytics Reporting API:
    • Go to your Google Cloud Platform Console
    • Pick the project you are using
    • Go to API Manager/Dashboard
    • Click on Enable API, search for Google Analytics Reporting API, click ENABLE
  2. Give Measure credentials to the websites' analytics you'd like to track:
    • Add the service account email to the list of users that has read permissions in the given analytics' accounts

Outputs

'Outputs' refers to secondary events and products related to a project, e.g. blog posts, talks given, or tangible uses of our products. These can be either internally produced, or external.

We capture these outputs manually using Google Forms, which writes the results to a Google Spreadsheet.

Outputs Captured by Google Forms

The Outputs processor requires a Google API account with generated credentials to read private Google Spreadsheet URLs. This is an explanation of what is collected by the processor:

  1. Make a copy of the Outputs Form template for your project (https://docs.google.com/a/okfn.org/forms/d/e/1FAIpQLSfQuBlwZMnWhGjCv4teAMdsKQ3pgbAi08ZwKBtZLAQFw7LqDg/viewform)
  2. Configure the associated spreadsheet destination where captured data will be written to. This can be found within the 'Responses' tab for the form, within the settings dropdown > 'Select response destination'
  3. Go to the form's spreadsheet and make a note of the sheetid and gid, which are part of the spreadsheet URL: https://docs.google.com/spreadsheets/d/{sheetid}/edit#gid={gid}
  4. Ensure the spreadsheet can be read by the Google API service account that is being used to authorise requests, either by making the spreadsheet public, or by sharing it with the email associated with the service account (defined in the generated credentials)
  5. Configure the Measure project with an entry for the Outputs processor:
# sheetid and gid correspond with the parts of the spreadsheet url:
# https://docs.google.com/spreadsheets/d/{sheetid}/edit#gid={gid}

config:
  outputs:
    - sheetid: "{sheetid from above}"
      gid: "{gid from above}"
      type: "external"  # the type of outputs captured here
    - sheetid: "{another sheetid}"
      gid: "{another gid}"
      type: "internal"

Email Campaigns

MailChimp

The MailChimp processor collects email list data each day. For each list the following is collected:

The processor will attempt to collect historic data upto the creation date of the list. Complete data is collected for subs, unsubs, and campaigns_sent. Partial historic data is collected for subscribers; once for the last day of each month when collecting historic data.

List ids are added to the project config file:

config:
  email:
    mailchimp:
      lists:
        - 'my-mailchimp-list-id'
        - 'another-mailchimp-list-id'

A MailChimp API key must be defined as an environmental variable. See below for details.

Forums

Discourse (Instance)

The Discourse processor collects daily forum data from an instance of a Discourse forum. For each domain listed in the config the following is collected:

The processor will collect historic data for all properties up to 2014-01-01 (except active_users).

config:
  forums:
    discourse:
      domains:
        - 'discourse.example.com'

Forum Categories

Closely related to the Forums pipeline above, providing a table to collect data about specific categories within a forum.

Discourse Categories

This processor collects daily data for each specified category within a Discourse forum domain. Each category listed for a domain will have the following data collected:

config:
  forum-categories:
    discourse-categories:
      - domain: 'discourse.example.com'
        categories:
          - name: 'my-toplevel-category' # will collect subcategory data
            children: 'expand'
          - name: 'another-toplevel-category' # will aggregate subcategory data
            children: 'aggregate'
          - name: 'a-category' # no `children` defined, will ignore subcategory data

Each category needs to specify a name, which is the slugified version of the category name (as it appears in the Discourse forum url), and an optional children parameter. The children parameter will determine how subcategories of the specified category are treated. It can accept one of:

Environmental Variables

Each installation of Measure requires certain environmental variables to be set.

General

Github

Twitter

Facebook

Google credentials for PyPI, Google analytics, and Outputs

See the PyPI Big Query API instructions above to get the values for these env vars:

MailChimp

Discourse