Spec - Githubissues

kamicut commented 7 years ago

A dataset is added to the catalog as a TOML file with metadata fields, the spec is adapted from Data Packages for interoperability. This is a good example of a data package, and OKFN has built a lot of tools in the ecosystem. TOML is chosen for readability.

Metadata

name: alphanumeric string, can contain -, _, .
resources: array of data objects that follow the resource information spec below
title: short sentence that describes the dataset
description: a longer description of the dataset
keywords: an array of keywords
license: string that indicates the license
author: object that can contain the fields name, email, web
homepage: URL to the dataset's web site
Resource Information
url: URL to the source
name: alphanumeric string, can contain -, _, .
title: short sentence that describes the resource
description: a longer description of the resource
format: the file extension

If the format is tabular (CSV), these are additional fields:

sample_url: a sample of the resource
schema: an array of objects that have the keys "name" and "type" (we can make this more complete if we follow the JSON table schema)
Example

name = "gdp"
title = "Country, Regional and World GDP (Gross Domestic Product)"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
license = "PDDL-1.0"
keywords = [ "GDP", "World", "Gross Domestic Product", "Time series"]

[[resources]]
  name = "gdp"
  url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv"
  format = "csv"

  [[resources.schema]]
  name = "Country Name"
  type = "string"

  [[resources.schema]]
  name = "Country Code"
  type = "string"

  [[resources.schema]]
  name = "Year"
  type = "date"

  [[resources.schema]]
  name = "Value"
  type = "number"

[[resources]]
  name = "gdp-market-prices"
  url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml"
  title = "GDP at market prices (current US$)"
  format = "XML"

We should define what fields are required and which ones are optional. Are there anymore fields which we could need?

menayat commented 7 years ago

I suggest to add the following fields to the metadata:

release date: the know date for the publication of source dataset
update/ modification date: the last modification date (generated automatically by the validator)
period Of time: the period of time covered by the dataset
update frequency: the expected/know frequency of update
maintainer: the details of entity responsible for scrapping/maintaining the dataset, it can be an object that can contain the fields name, email, web
simple url: defining a shortened url for a dataset that can be displayed on the site

I also suggest to keep the dates in metadata in gregorian calendar. The dates inside the datasets can be in Persian calendar.

In terms of bilingual fields, all the fields will be bilingual apart from:

Metadata

resources
homepage
release date
update/ modification date
simple url

Resource Information

url
format
sample_url

kamicut commented 7 years ago

@menayat thanks for contributing, those sound like great additions. I have a few questions that I've been thinking about:

Should we indicate which keys are optional/required? For example, I can see "title" being required, but maybe the "release date" cannot always be found.
I'm not sure what you mean by simple URL? Is it just for display purpose? In that case we can always have the link be called "Download" or "View Source Data" and the link be clickable to visit the site / download the content. Unless you envision some other way to share / visualize the link?
For bilingual and adding from a discussion with him, we can have fields such as title_en and title_fa. What are your thoughts on having a default key such as title and what language should that be in? We can also not have a default key, and the validation enforce that there should be a language (_en) attached to those keys.
A bilingual key must have both language keys: What happens in the validation pipeline if one of the keys doesn't have a bilingual counterpart? Should the validation fail until the maintainer translates all the keys? If not, what would the website text be for that key (e.g: "No content", "Translation needed", etc).

menayat commented 7 years ago

@kamicut Thanks for reviewing and the questions. Please see my responses below:

I agree that there might not be any input for some of the fields. My suggestion is to leave them mandatory anyway and entering "unknown". For instance if we don't know the 'release date' or 'update frequency', this way we can communicate to the user that we couldn't find the value for these fields rather, than it being an omission from our side.
By 'simple url', I mean what is going to be displayed in the url for the individual dataset page. I thought it would be better to have a human readable, properly formatted url schema, rather than having a numerical ID or a chopped title. It can also be used in the API calls.
It's hard to decide on the language syntax as I don't know what the potential implication might. However, now I have a slight preference for having the translations inside each field like:

"titles": {
  "en": "dataset title",
  "fa": "عنوان دادهt"
}

I think all the fields should be mandatory for the both languages. A dataset should not be validated until all the metadata fields are added in both languages. We can revise this policy later on.

And one question:

Do we need to have the homepage as a field if we are capturing the web address in the author's field?

menayat commented 7 years ago

@kamicut An afterthought: we don't need simple url. The field "title" will do the job.

kamicut commented 7 years ago

@menayat Ah, I get it now; yes, we can generate the URL ourselves using the title or some other identifier. I was thinking that the author would be the name of the organization providing the dataset, distinct from the organization's website.

menayat commented 7 years ago

@kamicut So can we say that the web in the author field is a reference to the Organization website and the homepage is the link to the original dataset?

kamicut commented 7 years ago

@menayat correct, I thought that those could potentially be two different websites; however I'm not sure if the case arises in practice. Is it your opinion that we should simplify it and keep it one URL?

menayat commented 7 years ago

@kamicut Yes, it will be highly unlikely. We can keep the author to just capture the name and the have the link to the dataset in the homepage.

kamicut commented 7 years ago

Here's an updated spec example using these additions: Title will be bilingual, but 'name' will be the short title that will be used in the URL, so it doesn't need to be bilingual. I kept keywords, license and author as a single language for simplicity but we can change that too. If a resource is a CSV, I kept the header fields as a single language as well.

Human Readable TOML:

license = "PDDL-1.0"
keywords = [ "GDP", "World", "Gross Domestic Product", "Time series"]
created_at = "2016-07-27T21:36:26.161Z"
updated_at = "2016-07-27T21:36:26.161Z"
author = "Organization Name"
homepage = "http://example.com/dataset/"

name = "gdp"

[[title]]
lang = "en"
title = "Country, Regional and World GDP (Gross Domestic Product)"

[[title]]
lang = "fa"
title = "Country, Regional and World GDP (Gross Domestic Product)"

[[description]]
lang = "fa"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."

[[description]]
lang = "en"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."

[[resources]]

  url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv"
  name = "gdp"

  [[resources.title]]
  lang = "en"
  title = "Country, Regional and World GDP (Gross Domestic Product)"

  [[resources.title]]
  lang = "fa"
  title = "Country, Regional and World GDP (Gross Domestic Product)"

  [resources.schema]
  format = "csv"

      [[resources.schema.fields]]
      name = "Country Name"
      type = "string"

      [[resources.schema.fields]]
      name = "Country Code"
      type = "string"

      [[resources.schema.fields]]
      name = "Year"
      type = "date"

      [[resources.schema.fields]]
      name = "Value"
      type = "number"

[[resources]]
  name = "gdp-market-prices"
  url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml"

  [[resources.title]]
  lang = "fa"
  text = "GDP at market prices (current US$)"

  [[resources.title]] 
  lang = "en"
  text = "GDP at market prices (current US$)"

  [resources.schema]
  format = "XML"

This will generate a JSON like this:

{
  "license": "PDDL-1.0",
  "keywords": [
    "GDP",
    "World",
    "Gross Domestic Product",
    "Time series"
  ],
  "created_at": "2016-07-27T21:36:26.161Z",
  "updated_at": "2016-07-27T21:36:26.161Z",
  "author": "Organization Name",
  "homepage": "http://example.com/dataset/",
  "name": "gdp",
  "title": [
    {
      "lang": "en",
      "title": "Country, Regional and World GDP (Gross Domestic Product)"
    },
    {
      "lang": "fa",
      "title": "Country, Regional and World GDP (Gross Domestic Product)"
    }
  ],
  "description": [
    {
      "lang": "fa",
      "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
    },
    {
      "lang": "en",
      "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
    }
  ],
  "resources": [
    {
      "url": "https://raw.github.com/datasets/gdp/master/data/gdp.csv",
      "name": "gdp",
      "title": [
        {
          "lang": "en",
          "title": "Country, Regional and World GDP (Gross Domestic Product)"
        },
        {
          "lang": "fa",
          "title": "Country, Regional and World GDP (Gross Domestic Product)"
        }
      ],
      "schema": {
        "format": "csv",
        "fields": [
          {
            "name": "Country Name",
            "type": "string"
          },
          {
            "name": "Country Code",
            "type": "string"
          },
          {
            "name": "Year",
            "type": "date"
          },
          {
            "name": "Value",
            "type": "number"
          }
        ]
      }
    },
    {
      "name": "gdp-market-prices",
      "url": "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml",
      "title": [
        {
          "lang": "fa",
          "text": "GDP at market prices (current US$)"
        },
        {
          "lang": "en",
          "text": "GDP at market prices (current US$)"
        }
      ],
      "schema": {
        "format": "XML"
      }
    }
  ]
}

I haven't written the text fields in Farsi because I don't have a Farsi keyboard 😄 , but since the spec is just text it works. cc @menayat

menayat commented 7 years ago

@kamicut Thanks for TOML and JSON examples. It seems the following fields are missing:

release date
period of time
update frequency
maintainer

Otherwise looks great!

kamicut commented 7 years ago

@menayat so I called the release date created_at, but we can change it to release_date if you think that's clearer.

What should period of time be? Same date format but as an array? [2014-07-27T21:36Z, 2016-07-27T21:36Z]
For update frequency, do you envision a frequency shorter than 1 day? We can then have this as an integer and express it as "update frequency in days"
Maintainer: is this a name, email, or both?

menayat commented 7 years ago

@kamicut _createdat is fine. Would it be possible to have text fields for the period of time and update frequency? They will be descriptive: We might not have detailed date for period of time , we might only have the year. And for the update frequency, it will be monthly, quarterly, yearly, N/A.

kamicut commented 7 years ago

That sounds fine @menayat, I was more thinking in terms of if we want to have automatic processes that want to fetch or regenerate the data at regular intervals in the future, however we can revisit the question later.

menayat commented 7 years ago

@kamicut While we were working on completing the metadata for the first dataset and checking the site design, we noticed a few issues related to the metadata:

We need to add the category to metadata. We have used it in the design but it is not in the metadata.
We are a bit confused about the Source field: we think that the Source is different to Resources field: Source is the link to the original data source. Resources is a link to an array of data object for the cleaned data. Is this a correct assumption? If so, we need to add Source to the metadata as well.

Thanks!

kamicut commented 7 years ago

@menayat

For category, my initial thinking is that it would be in the "keywords" list, in case there are multiple "tags". However if you think this isn't sufficient we could have a predefined list of categories.
For source we can implement the "sources" keyword in the Data Package spec. It has name and web for the URL. Should this be on the dataset level or the resource level?

menayat commented 7 years ago

@kamicut we are aiming for fit each dataset into a particular category. However, there will be odd cases. I was going to suggest to have the first keyword as the primary category. However what if there is more than one category? I think it might be easier to have a field dedicated to category, with the understanding that there might be more than one category.

on the sources, conceptually we have been thinking of one data package per entry or at least start this way and then on decide to merge. Initially it will be easier for us to frame the metadata around one particular dataset. So if you are happy with it, let's move the sources to the resource level.

kamicut commented 7 years ago

@menayat if we are moving the sources to the resource level, I think we can remove the "homepage" from being a required tag, I don't see any use for it.

menayat commented 7 years ago

@kamicut Thanks for putting together the spec. It looks fine. A quick question: Am I correct to assume that the fields of _Indexat and _updated_a_t are automatically populated by the validator? If so, it'd be great to refer to it in the index.

kamicut commented 7 years ago

@menayat they're not added automatically right now. We could maybe look into having indexed_at being automatic, but I think updated_at refers to the ingest process and should be updated whenever there's new data.

Since the validator doesn't see the data and data updates could happen manually, I think this field should be manually updated.

menayat commented 7 years ago

@kamicut I have a question regarding _updatedat: It sits in the metadata level, shouldn't it be reflective of the last time that there has been any change to the dataset metadata?

kamicut commented 7 years ago

@menayat you're right, same as update frequency. They could be at either level, depending on whether you're describing the project or the individual resources.

menayat commented 7 years ago

@kamicut My suggestion is to have both of them timestamped automatically. But I leave the final decision to you.

kamicut commented 7 years ago

@menayat I've added this in #14. Now the timestamps are generated automatically:

indexed_at will be added when there's a new dataset
updated_at will be updated when the metadata changes

SoniaAmini commented 7 years ago

@kamicut I was wondering if we can make the following changes?

1-Make 'period' and 'frequency' optional on the metadata.

2-Please allow for Persian keyboard to be added, because adding two languages in the current keyword field will result in Persian keywords showing up in the English site and vice versa.

3-How can we allocate a dataset to more than one category?

4-Can you please add more information on validation process and the related error message to the spec?

5- On the temporal coverage on the metadata can we change the date from Gregorian to the Iranian calendar? Please also update the spec.

Many thanks!

kamicut commented 7 years ago

@SoniaAmini

Period and frequency are optional now.
I'll look into this, it might change quite a bit of the data pipeline
Per the above discussion with @menayat (see Aug 30), we can add other categories in the keywords, but there is always a "primary" category.
We're using automatic validation messages from JSON schema. I'll see if there's a library to improve error readability.
I'm not sure what you mean, the temporal coverage can be any two numbers, so it should work with the Iranian calendar.

SoniaAmini commented 7 years ago

@kamicut Thank you very much for addressing these issues. On the issue 5, we meant

a) the date range on the data page (two drop downs). We assumed they are connected to the spec, which they are not. We will report this issue separately.
b) can you update the spec to reflect that the temporal coverage is based on the Iranian calendar.

Thanks!

kamicut commented 7 years ago

@SoniaAmini On the Iran Open Data website, the date range is automatically created from the minimum and maximum dates it finds in the catalog. So if the datasets all fall in the Iranian calendar dates, it will automatically change the dropdowns. I'll update the spec to reflect that the temporal coverage should use the Iranian calendar.

iranopendata / catalog

Spec #1

Metadata

Resource Information

Example