iranopendata / catalog

A catalog of public datasets
8 stars 0 forks source link

Spec #1

Closed kamicut closed 6 years ago

kamicut commented 7 years ago

A dataset is added to the catalog as a TOML file with metadata fields, the spec is adapted from Data Packages for interoperability. This is a good example of a data package, and OKFN has built a lot of tools in the ecosystem. TOML is chosen for readability.

Metadata

If the format is tabular (CSV), these are additional fields:

name = "gdp"
title = "Country, Regional and World GDP (Gross Domestic Product)"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
license = "PDDL-1.0"
keywords = [ "GDP", "World", "Gross Domestic Product", "Time series"]

[[resources]]
  name = "gdp"
  url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv"
  format = "csv"

  [[resources.schema]]
  name = "Country Name"
  type = "string"

  [[resources.schema]]
  name = "Country Code"
  type = "string"

  [[resources.schema]]
  name = "Year"
  type = "date"

  [[resources.schema]]
  name = "Value"
  type = "number"

[[resources]]
  name = "gdp-market-prices"
  url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml"
  title = "GDP at market prices (current US$)"
  format = "XML"

We should define what fields are required and which ones are optional. Are there anymore fields which we could need?

menayat commented 7 years ago

I suggest to add the following fields to the metadata:

I also suggest to keep the dates in metadata in gregorian calendar. The dates inside the datasets can be in Persian calendar.

In terms of bilingual fields, all the fields will be bilingual apart from:

Metadata

Resource Information

kamicut commented 7 years ago

@menayat thanks for contributing, those sound like great additions. I have a few questions that I've been thinking about:

menayat commented 7 years ago

@kamicut Thanks for reviewing and the questions. Please see my responses below:

"titles": {
  "en": "dataset title",
  "fa": "عنوان دادهt"
}

And one question:

menayat commented 7 years ago

@kamicut An afterthought: we don't need simple url. The field "title" will do the job.

kamicut commented 7 years ago

@menayat Ah, I get it now; yes, we can generate the URL ourselves using the title or some other identifier. I was thinking that the author would be the name of the organization providing the dataset, distinct from the organization's website.

menayat commented 7 years ago

@kamicut So can we say that the web in the author field is a reference to the Organization website and the homepage is the link to the original dataset?

kamicut commented 7 years ago

@menayat correct, I thought that those could potentially be two different websites; however I'm not sure if the case arises in practice. Is it your opinion that we should simplify it and keep it one URL?

menayat commented 7 years ago

@kamicut Yes, it will be highly unlikely. We can keep the author to just capture the name and the have the link to the dataset in the homepage.

kamicut commented 7 years ago

Here's an updated spec example using these additions: Title will be bilingual, but 'name' will be the short title that will be used in the URL, so it doesn't need to be bilingual. I kept keywords, license and author as a single language for simplicity but we can change that too. If a resource is a CSV, I kept the header fields as a single language as well.

Human Readable TOML:

license = "PDDL-1.0"
keywords = [ "GDP", "World", "Gross Domestic Product", "Time series"]
created_at = "2016-07-27T21:36:26.161Z"
updated_at = "2016-07-27T21:36:26.161Z"
author = "Organization Name"
homepage = "http://example.com/dataset/"

name = "gdp"

[[title]]
lang = "en"
title = "Country, Regional and World GDP (Gross Domestic Product)"

[[title]]
lang = "fa"
title = "Country, Regional and World GDP (Gross Domestic Product)"

[[description]]
lang = "fa"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."

[[description]]
lang = "en"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."

[[resources]]

  url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv"
  name = "gdp"

  [[resources.title]]
  lang = "en"
  title = "Country, Regional and World GDP (Gross Domestic Product)"

  [[resources.title]]
  lang = "fa"
  title = "Country, Regional and World GDP (Gross Domestic Product)"

  [resources.schema]
  format = "csv"

      [[resources.schema.fields]]
      name = "Country Name"
      type = "string"

      [[resources.schema.fields]]
      name = "Country Code"
      type = "string"

      [[resources.schema.fields]]
      name = "Year"
      type = "date"

      [[resources.schema.fields]]
      name = "Value"
      type = "number"

[[resources]]
  name = "gdp-market-prices"
  url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml"

  [[resources.title]]
  lang = "fa"
  text = "GDP at market prices (current US$)"

  [[resources.title]] 
  lang = "en"
  text = "GDP at market prices (current US$)"

  [resources.schema]
  format = "XML"

This will generate a JSON like this:

{
  "license": "PDDL-1.0",
  "keywords": [
    "GDP",
    "World",
    "Gross Domestic Product",
    "Time series"
  ],
  "created_at": "2016-07-27T21:36:26.161Z",
  "updated_at": "2016-07-27T21:36:26.161Z",
  "author": "Organization Name",
  "homepage": "http://example.com/dataset/",
  "name": "gdp",
  "title": [
    {
      "lang": "en",
      "title": "Country, Regional and World GDP (Gross Domestic Product)"
    },
    {
      "lang": "fa",
      "title": "Country, Regional and World GDP (Gross Domestic Product)"
    }
  ],
  "description": [
    {
      "lang": "fa",
      "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
    },
    {
      "lang": "en",
      "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
    }
  ],
  "resources": [
    {
      "url": "https://raw.github.com/datasets/gdp/master/data/gdp.csv",
      "name": "gdp",
      "title": [
        {
          "lang": "en",
          "title": "Country, Regional and World GDP (Gross Domestic Product)"
        },
        {
          "lang": "fa",
          "title": "Country, Regional and World GDP (Gross Domestic Product)"
        }
      ],
      "schema": {
        "format": "csv",
        "fields": [
          {
            "name": "Country Name",
            "type": "string"
          },
          {
            "name": "Country Code",
            "type": "string"
          },
          {
            "name": "Year",
            "type": "date"
          },
          {
            "name": "Value",
            "type": "number"
          }
        ]
      }
    },
    {
      "name": "gdp-market-prices",
      "url": "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml",
      "title": [
        {
          "lang": "fa",
          "text": "GDP at market prices (current US$)"
        },
        {
          "lang": "en",
          "text": "GDP at market prices (current US$)"
        }
      ],
      "schema": {
        "format": "XML"
      }
    }
  ]
}

I haven't written the text fields in Farsi because I don't have a Farsi keyboard 😄 , but since the spec is just text it works. cc @menayat

menayat commented 7 years ago

@kamicut Thanks for TOML and JSON examples. It seems the following fields are missing:

Otherwise looks great!

kamicut commented 7 years ago

@menayat so I called the release date created_at, but we can change it to release_date if you think that's clearer.

menayat commented 7 years ago

@kamicut _createdat is fine. Would it be possible to have text fields for the period of time and update frequency? They will be descriptive: We might not have detailed date for period of time , we might only have the year. And for the update frequency, it will be monthly, quarterly, yearly, N/A.

kamicut commented 7 years ago

That sounds fine @menayat, I was more thinking in terms of if we want to have automatic processes that want to fetch or regenerate the data at regular intervals in the future, however we can revisit the question later.

menayat commented 7 years ago

@kamicut While we were working on completing the metadata for the first dataset and checking the site design, we noticed a few issues related to the metadata:

Thanks!

kamicut commented 7 years ago

@menayat

menayat commented 7 years ago

@kamicut we are aiming for fit each dataset into a particular category. However, there will be odd cases. I was going to suggest to have the first keyword as the primary category. However what if there is more than one category? I think it might be easier to have a field dedicated to category, with the understanding that there might be more than one category.

on the sources, conceptually we have been thinking of one data package per entry or at least start this way and then on decide to merge. Initially it will be easier for us to frame the metadata around one particular dataset. So if you are happy with it, let's move the sources to the resource level.

kamicut commented 7 years ago

@menayat if we are moving the sources to the resource level, I think we can remove the "homepage" from being a required tag, I don't see any use for it.

menayat commented 7 years ago

@kamicut Thanks for putting together the spec. It looks fine. A quick question: Am I correct to assume that the fields of _Indexat and _updated_a_t are automatically populated by the validator? If so, it'd be great to refer to it in the index.

kamicut commented 7 years ago

@menayat they're not added automatically right now. We could maybe look into having indexed_at being automatic, but I think updated_at refers to the ingest process and should be updated whenever there's new data.

Since the validator doesn't see the data and data updates could happen manually, I think this field should be manually updated.

menayat commented 7 years ago

@kamicut I have a question regarding _updatedat: It sits in the metadata level, shouldn't it be reflective of the last time that there has been any change to the dataset metadata?

kamicut commented 7 years ago

@menayat you're right, same as update frequency. They could be at either level, depending on whether you're describing the project or the individual resources.

menayat commented 7 years ago

@kamicut My suggestion is to have both of them timestamped automatically. But I leave the final decision to you.

kamicut commented 7 years ago

@menayat I've added this in #14. Now the timestamps are generated automatically:

SoniaAmini commented 7 years ago

@kamicut I was wondering if we can make the following changes?

1-Make 'period' and 'frequency' optional on the metadata.

2-Please allow for Persian keyboard to be added, because adding two languages in the current keyword field will result in Persian keywords showing up in the English site and vice versa.

3-How can we allocate a dataset to more than one category?

4-Can you please add more information on validation process and the related error message to the spec?

5- On the temporal coverage on the metadata can we change the date from Gregorian to the Iranian calendar? Please also update the spec.

Many thanks!

kamicut commented 7 years ago

@SoniaAmini

  1. Period and frequency are optional now.
  2. I'll look into this, it might change quite a bit of the data pipeline
  3. Per the above discussion with @menayat (see Aug 30), we can add other categories in the keywords, but there is always a "primary" category.
  4. We're using automatic validation messages from JSON schema. I'll see if there's a library to improve error readability.
  5. I'm not sure what you mean, the temporal coverage can be any two numbers, so it should work with the Iranian calendar.
SoniaAmini commented 7 years ago

@kamicut Thank you very much for addressing these issues. On the issue 5, we meant

kamicut commented 7 years ago

@SoniaAmini On the Iran Open Data website, the date range is automatically created from the minimum and maximum dates it finds in the catalog. So if the datasets all fall in the Iranian calendar dates, it will automatically change the dropdowns. I'll update the spec to reflect that the temporal coverage should use the Iranian calendar.