Closed kamicut closed 6 years ago
I suggest to add the following fields to the metadata:
I also suggest to keep the dates in metadata in gregorian calendar. The dates inside the datasets can be in Persian calendar.
In terms of bilingual fields, all the fields will be bilingual apart from:
Metadata
Resource Information
@menayat thanks for contributing, those sound like great additions. I have a few questions that I've been thinking about:
title_en
and title_fa
. What are your thoughts on having a default key such as title
and what language should that be in? We can also not have a default key, and the validation enforce that there should be a language (_en
) attached to those keys. @kamicut Thanks for reviewing and the questions. Please see my responses below:
"titles": {
"en": "dataset title",
"fa": "عنوان دادهt"
}
And one question:
@kamicut An afterthought: we don't need simple url. The field "title" will do the job.
@menayat Ah, I get it now; yes, we can generate the URL ourselves using the title or some other identifier. I was thinking that the author would be the name of the organization providing the dataset, distinct from the organization's website.
@kamicut So can we say that the web in the author field is a reference to the Organization website and the homepage is the link to the original dataset?
@menayat correct, I thought that those could potentially be two different websites; however I'm not sure if the case arises in practice. Is it your opinion that we should simplify it and keep it one URL?
@kamicut Yes, it will be highly unlikely. We can keep the author to just capture the name and the have the link to the dataset in the homepage.
Here's an updated spec example using these additions: Title will be bilingual, but 'name' will be the short title that will be used in the URL, so it doesn't need to be bilingual. I kept keywords, license and author as a single language for simplicity but we can change that too. If a resource is a CSV, I kept the header fields as a single language as well.
Human Readable TOML:
license = "PDDL-1.0"
keywords = [ "GDP", "World", "Gross Domestic Product", "Time series"]
created_at = "2016-07-27T21:36:26.161Z"
updated_at = "2016-07-27T21:36:26.161Z"
author = "Organization Name"
homepage = "http://example.com/dataset/"
name = "gdp"
[[title]]
lang = "en"
title = "Country, Regional and World GDP (Gross Domestic Product)"
[[title]]
lang = "fa"
title = "Country, Regional and World GDP (Gross Domestic Product)"
[[description]]
lang = "fa"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
[[description]]
lang = "en"
description = "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
[[resources]]
url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv"
name = "gdp"
[[resources.title]]
lang = "en"
title = "Country, Regional and World GDP (Gross Domestic Product)"
[[resources.title]]
lang = "fa"
title = "Country, Regional and World GDP (Gross Domestic Product)"
[resources.schema]
format = "csv"
[[resources.schema.fields]]
name = "Country Name"
type = "string"
[[resources.schema.fields]]
name = "Country Code"
type = "string"
[[resources.schema.fields]]
name = "Year"
type = "date"
[[resources.schema.fields]]
name = "Value"
type = "number"
[[resources]]
name = "gdp-market-prices"
url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml"
[[resources.title]]
lang = "fa"
text = "GDP at market prices (current US$)"
[[resources.title]]
lang = "en"
text = "GDP at market prices (current US$)"
[resources.schema]
format = "XML"
This will generate a JSON like this:
{
"license": "PDDL-1.0",
"keywords": [
"GDP",
"World",
"Gross Domestic Product",
"Time series"
],
"created_at": "2016-07-27T21:36:26.161Z",
"updated_at": "2016-07-27T21:36:26.161Z",
"author": "Organization Name",
"homepage": "http://example.com/dataset/",
"name": "gdp",
"title": [
{
"lang": "en",
"title": "Country, Regional and World GDP (Gross Domestic Product)"
},
{
"lang": "fa",
"title": "Country, Regional and World GDP (Gross Domestic Product)"
}
],
"description": [
{
"lang": "fa",
"description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
},
{
"lang": "en",
"description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV."
}
],
"resources": [
{
"url": "https://raw.github.com/datasets/gdp/master/data/gdp.csv",
"name": "gdp",
"title": [
{
"lang": "en",
"title": "Country, Regional and World GDP (Gross Domestic Product)"
},
{
"lang": "fa",
"title": "Country, Regional and World GDP (Gross Domestic Product)"
}
],
"schema": {
"format": "csv",
"fields": [
{
"name": "Country Name",
"type": "string"
},
{
"name": "Country Code",
"type": "string"
},
{
"name": "Year",
"type": "date"
},
{
"name": "Value",
"type": "number"
}
]
}
},
{
"name": "gdp-market-prices",
"url": "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=xml",
"title": [
{
"lang": "fa",
"text": "GDP at market prices (current US$)"
},
{
"lang": "en",
"text": "GDP at market prices (current US$)"
}
],
"schema": {
"format": "XML"
}
}
]
}
I haven't written the text fields in Farsi because I don't have a Farsi keyboard 😄 , but since the spec is just text it works. cc @menayat
@kamicut Thanks for TOML and JSON examples. It seems the following fields are missing:
Otherwise looks great!
@menayat so I called the release date created_at
, but we can change it to release_date
if you think that's clearer.
@kamicut _createdat is fine. Would it be possible to have text fields for the period of time and update frequency? They will be descriptive: We might not have detailed date for period of time , we might only have the year. And for the update frequency, it will be monthly, quarterly, yearly, N/A.
That sounds fine @menayat, I was more thinking in terms of if we want to have automatic processes that want to fetch or regenerate the data at regular intervals in the future, however we can revisit the question later.
@kamicut While we were working on completing the metadata for the first dataset and checking the site design, we noticed a few issues related to the metadata:
Thanks!
@menayat
name
and web
for the URL. Should this be on the dataset level or the resource level? @kamicut we are aiming for fit each dataset into a particular category. However, there will be odd cases. I was going to suggest to have the first keyword as the primary category. However what if there is more than one category? I think it might be easier to have a field dedicated to category, with the understanding that there might be more than one category.
on the sources, conceptually we have been thinking of one data package per entry or at least start this way and then on decide to merge. Initially it will be easier for us to frame the metadata around one particular dataset. So if you are happy with it, let's move the sources to the resource level.
@menayat if we are moving the sources to the resource level, I think we can remove the "homepage" from being a required tag, I don't see any use for it.
@kamicut Thanks for putting together the spec. It looks fine. A quick question: Am I correct to assume that the fields of _Indexat and _updated_a_t are automatically populated by the validator? If so, it'd be great to refer to it in the index.
@menayat they're not added automatically right now. We could maybe look into having indexed_at
being automatic, but I think updated_at
refers to the ingest process and should be updated whenever there's new data.
Since the validator doesn't see the data and data updates could happen manually, I think this field should be manually updated.
@kamicut I have a question regarding _updatedat: It sits in the metadata level, shouldn't it be reflective of the last time that there has been any change to the dataset metadata?
@menayat you're right, same as update frequency. They could be at either level, depending on whether you're describing the project or the individual resources.
@kamicut My suggestion is to have both of them timestamped automatically. But I leave the final decision to you.
@menayat I've added this in #14. Now the timestamps are generated automatically:
@kamicut I was wondering if we can make the following changes?
1-Make 'period' and 'frequency' optional on the metadata.
2-Please allow for Persian keyboard to be added, because adding two languages in the current keyword field will result in Persian keywords showing up in the English site and vice versa.
3-How can we allocate a dataset to more than one category?
4-Can you please add more information on validation process and the related error message to the spec?
5- On the temporal coverage on the metadata can we change the date from Gregorian to the Iranian calendar? Please also update the spec.
Many thanks!
@SoniaAmini
@kamicut Thank you very much for addressing these issues. On the issue 5, we meant
b) can you update the spec to reflect that the temporal coverage is based on the Iranian calendar.
Thanks!
@SoniaAmini On the Iran Open Data website, the date range is automatically created from the minimum and maximum dates it finds in the catalog. So if the datasets all fall in the Iranian calendar dates, it will automatically change the dropdowns. I'll update the spec to reflect that the temporal coverage should use the Iranian calendar.
A dataset is added to the catalog as a TOML file with metadata fields, the spec is adapted from Data Packages for interoperability. This is a good example of a data package, and OKFN has built a lot of tools in the ecosystem. TOML is chosen for readability.
Metadata
name
: alphanumeric string, can contain-
,_
,.
resources
: array of data objects that follow the resource information spec belowtitle
: short sentence that describes the datasetdescription
: a longer description of the datasetkeywords
: an array of keywordslicense
: string that indicates the licenseauthor
: object that can contain the fieldsname
,email
,web
homepage
: URL to the dataset's web siteResource Information
url
: URL to the sourcename
: alphanumeric string, can contain-
,_
,.
title
: short sentence that describes the resourcedescription
: a longer description of the resourceformat
: the file extensionIf the format is tabular (CSV), these are additional fields:
sample_url
: a sample of the resourceschema
: an array of objects that have the keys "name" and "type" (we can make this more complete if we follow the JSON table schema)Example
We should define what fields are required and which ones are optional. Are there anymore fields which we could need?