openbudgets / datasets

OpenBudgets.eu datasets
4 stars 3 forks source link

Provide metadata inside each dataset #24

Closed larjohn closed 7 years ago

larjohn commented 8 years ago

Can we consistently provide metadata inside our datasets?

Such metadata would be:

  1. Author / contact point name (so that it can be displayed in the OpenSpending Viewer)
  2. Friendly dataset name (also good for the OpenSpending Viewer)
  3. Region/Organization URI (will be used later for mapping and normalizations)
  4. You suggest...

You can find sample implementation in the Thessaloniki and Athens datasets.

jindrichmynarz commented 8 years ago

We currently store metadata generated by the e-datasetMetadata and e-distributionMetadata DPUs in a separate named graph. @jakubklimek, can you please comment on this alternative solution?

jakubklimek commented 8 years ago

OK, I don't see what i wrong with using the existing metadata DPUs. The metadata in the mentioned datasets and the requirements mentioned in this issue are covered by those.

jindrichmynarz commented 8 years ago

I think this is mostly a matter of packaging data. While we use the metadata DPUs to produce a separate named graph, the Greek datasets mentioned by @larjohn directly contain metadata.

The proposals are also compatible only if we use the qb:DataSets IRI as the IRI of its named graph. This may be limiting in case we want to have multiple instances of qb:DataSet in one dataset corresponding to one named graph. Provided that we haven't encountered this case (unless I overlooked it), I think we can postpone this concern and adopt a guide that the qb:DataSets IRI is used as the IRI of its named graph.

jakubklimek commented 8 years ago

I think this does not matter much.

As far as I know, this is exactly the case, the IRI of the qb:DataSet matches the IRI of the named graph. I had one dataset split into 3 with the esf projects in the Czech Republic and I used this pattern also, as can be seen in the pipieline.

jindrichmynarz commented 8 years ago

Well, since @larjohn raised this point, I think it matters to agree on a convention on which visualization OpenBudgets.eu tools can rely.

jakubklimek commented 8 years ago

I think visualization tools should not rely on a particular splitting of data into named graphs because that would unnecessarily fixate them on some non-standardized convention. They should work with all triples in the datastore, queying for necessary data and therefore I think it does not matter whether this is in one graph or in two.

jindrichmynarz commented 8 years ago

Right. I didn't mean to agree on convention for splitting data into named graphs, but on a convention for attaching metadata. Metadata DPUs attach metadata to the IRI of the dataset's named graph. In the case of Greek datasets metadata is attached to instances of qb:DataSet. If we adopt a convention that qb:DataSets IRI is used as the IRI of the dataset's named graph, these 2 conventions would have the same effect. The only problem with this convention that I see is that is does not specify what happens when there are multiple instances of qb:DataSet in a dataset. I expected this to happen rarely, but you said you needed this for the Czech ESD data.

jakubklimek commented 8 years ago

Actually I said I avoided it in the Czech data by splitting it into 3 data graphs (and 3 metadata graphs) and I think this can be done every time - split the data into graphs according to the qb:DataSets, which only seems natural when from DCV point of view, they are also different datasets.

jindrichmynarz commented 8 years ago

I see. I misread your previous comment. Splitting datasets per qb:DataSet into multiple named graphs definitely works. I'm only afraid of the extra work it incurs. Each loader and metadata DPU must be configured separately for each named graph.

Ultimately, this issue leads to a need to maintain a document with the conventions we adopted for the datasets of OpenBudgets.eu. Do you have a preference how to maintain these conventions?

jakubklimek commented 8 years ago

Well, the question still is whether this needs to be covered by a convention or whether it does not matter. In my opinion, the conventions should only contain items that, when not followed, break some functionality or user friendliness.

If a convention for this is to be established, I of course prefer our approach as it is always easier to merge 2 graphs if necessary than to split them.

The document with conventions is a google doc, so it can be amended as necessary, right?

jindrichmynarz commented 8 years ago

Prompted by @larjohn mentioning metadata regarding interoperability, I'd like to resolve this issue. Would it be OK if we fleshed out the recommended metadata section in D1.5? Alternatively, we can also add validation of the required metadata into openbudgets/pipeline-fragments. @larjohn, is there any metadata that is a must for your work?

larjohn commented 8 years ago

I am mostly concerned for the metadata currently presented by the OpenSpending viewer:

  1. Human readable title of the dataset
  2. Description
  3. Author(s?) - they've got name and mail, could be a foaf:Person for us

Additionally, most important is the Organization, the type of the organization and maybe some kind of location, either inside the organization, or outside of it. An ideal (to me) dataset would contain three more dataset dimensions:

  1. Organization (an org entity)
  2. Organization type (government, region, prefecture, municipality, multi-state union etc) - probably a code list. Might be included into the organization entity, although it will be useful for the dataset itself: for instance, to filter datasets that contain budgets for municipalities
  3. Location: this is needed in order to get an exact polygon of the place the budget is for, in order to display many datasets in a single map.

Please excuse me if any of these extras are already included in the schema.

jindrichmynarz commented 8 years ago

LP-ETL DPU e-datasetMetadata can provide all the metadata for OS viewer.

  1. Human readable title of the dataset: dcterms:title
  2. Description: dcterms:description
  3. Author: dcterms:creator, however only IRI can be provided

I think only dcterms:creator is problematic. However, it is problematic in two ways. First, the DPU allows to provide only an IRI of the author. If we need properties describing authors (names, email addresses), then either only dereferenceable IRI can be used and their representations must contain the required properties or the DPU must directly allow filling in the required properties. Second, it is unclear who the author is. Is it the person who transformed the dataset? Is it the original government body that published the data? I think lot of time was already spent discussing these questions in the context of DCAT and DCAT-AP.

Regarding the organization, obeu-dimension:organization (or its subproperty) is a required dimension of OpenBudgets.eu data. Organization is represented as org:Organization (or its subclass, e.g., from the Core Public Organization Vocabulary). Organization type can be represented via rov:orgType from the Registered Organization Vocabulary or cpov:classification from the Core Public Organization Vocabulary. Would you suggest code list to use for organization types? EU Publications Office offers Organization type code list but it doesn't seem to fit very much. Location should be a property of organization, such as org:hasSite or schema:address.

Guidance on metadata to be provided with OpenBudgets.eu datasets is already included in D1.4, but it is very generic, so we can provide more concrete recommendations in D1.5.

larjohn commented 8 years ago

I will have a look!

In the meantime, what would you suggest for the time dimension?

jindrichmynarz commented 8 years ago

Do you mean temporal coverage of a dataset? DCAT recommends to use dcterms:temporal for temporal coverage, which links to a dcterms:PeriodOfTime with start and end. Given that the suggested representation of start and end is not RDF, I'd simply use schema:startDate and schema:endDate instead.

However, temporal coverage can be automatically computed from the dataset, using obeu-dimension:fiscalYear, obeu-dimension:fiscalPeriod, or obeu-dimension:date, so it's probably not necessary to provide it manually in the metadata.

larjohn commented 8 years ago

I was wondering what were the expected values for the fiscalYear dimension.

jindrichmynarz commented 8 years ago

D1.4 recommends to use interval:Year as range of obeu-dimension:fiscalYear. The instances of interval:Year can be drawn from <http://reference.data.gov.uk/id/gregorian-year/{YYYY}>. It seems most datasets in this repository already follow this convention.

Some datasets use British years (i.e. <http://reference.data.gov.uk/id/year/{YYYY}>) instead. I must admit I don't know the difference between Gregorian and British year. According to Wikipedia, it seems that Great Britain adopted Gregorian calendar a long time ago, so I presume there's no practical difference between Gregorian and British year.

larjohn commented 8 years ago

Before closing this issue, let's make sure, each dataset has the following information, according to what is discussed so far here:

jakubklimek commented 8 years ago

This can be all entered in the existing metadata components in LP-ETL. Currently, I am working on the DCAT-AP v1.1 version of those components (https://github.com/linkedpipes/etl/tree/feature/dcatAp11) which would be even more suitable for use and I expect them to be ready by the end of this week.

My recommendation is to wait for them and use them for this task.

larjohn commented 8 years ago

Cool, automating this is the way to go!

jakubklimek commented 8 years ago

As of now, the old metadata components are deprecated and the new DCAT-AP Dataset and DCAT-AP Distribution components should be used. They implement the current DCAT-AP v1.1 specification.

The organization, dataset name and fiscal period items are quite clear in DCAT-AP v1.1 and I suggest we store the author/uploader of the dataset as a contact point.

skarampatakis commented 8 years ago

Would Data ID cover all off our needs? https://github.com/dbpedia/DataId-Ontology. Regarding that is on top of DCAT, VoID, Prov-O and FOAF I believe that it could be a possible candidate.

jindrichmynarz commented 8 years ago

I think DataID is powerful. However, providing all that metadata takes effort. I'm not sure if such effort would pay off in our use cases. Do you think that it addresses a need that we have that cannot be addressed by the much more basic set of metadata proposed above?

skarampatakis commented 8 years ago

I believe many of the problems discussed here and on openbudgets/platform#13 can be adressed. I also like the fact that metadata are attached to the dataset itself. Additionally, applications that will be implemented based on DataID would give additional value. Compliance with datahub.io and automated publishing, or versioning to name a few. Sure is too early, as DataID is on development as Markus Freudenberg said, but I think we should explore this direction thoroughly.

I think that the transition is easy as we already use parts of the DataID pilars.

jindrichmynarz commented 8 years ago

I believe many of the problems discussed here and on openbudgets/platform#13 can be adressed.

Can you say which problems are not addressed by the current proposal?

I also like the fact that metadata are attached to the dataset itself.

The current proposal also attaches metadata to the dataset (i.e. instance of qb:DataSet).

Additionally, applications that will be implemented based on DataID would give additional value. Compliance with datahub.io and automated publishing, or versioning to name a few. Sure is too early, as DataID is on development as Markus Freudenberg said, but I think we should explore this direction thoroughly.

Could you give an example of a concrete benefit you foresee to get for the OpenBudgets.eu use cases that would justify the effort in providing DataID metadata?

skarampatakis commented 8 years ago

Can you say which problems are not addressed by the current proposal?

Agent roles, reffering DataID motivation below :

By defining rights and responsibilities of agents together with the dataset metadata deals with common uncertainties as to whom to contact about a dataset or who published certain datasets (and many more).

Also automated publishing on LOD Cloud, is a wanted aspect. I know that we can already have this functionality using LP, but it requires additional step on the pipelines. We can have all in once using DataID.

Generally speaking, I believe that we have tried our best to find a best practice about metadata, which is pretty close to what DataID is offering. From my point of view, DataID wraps up what we have already recognised as needs, with a few additions, giving added value by the applications that will be developed around DataID. Plus it is an upper level ontology. It is an early adoption I can say, if we go that direction, but I believe it worths it.

jindrichmynarz commented 8 years ago

In terms of roles, DCAT-AP Dataset component gives us the distinction betweendcterms:publisher and dcat:contactPoint. If the use case is to clarify whom to contact about a particular dataset, then I think it is already addressed by dcat:contactPoint.

Also automated publishing on LOD Cloud, is a wanted aspect.

What do you mean by automated publishing on LOD Cloud? I don't think DataID automates this procedure.

Generally speaking, I believe that we have tried our best to find a best practice about metadata, which is pretty close to what DataID is offering.

I agree. What I'm trying to figure out is whether DataID is worth the extra effort.

Plus it is an upper level ontology.

In what sense do you mean it's an upper level ontology? Upper ontology is a cross-cutting ontology, such as SUMO, while DataID is specific to the domain of dataset description.

skarampatakis commented 8 years ago

DataID provides more descriptive terms of agent roles. You can check the whole model over here.

What do you mean by automated publishing on LOD Cloud? I don't think DataID automates this procedure.

LOD publishing refers to datahub.io or similar data hubs. Not to draw a diagram. If a module to harvest datasets metadata from DataID metadata will be developed, we coud benefit. So metadata will be published not only on specific, let say CKAN, instance, but on every compliant data hub.

In what sense do you mean it's an upper level ontology?

In the sense that it functions as a wrapper of other mid-level ontologies. Maybe upper is not the best fit for it's description.

I agree. What I'm trying to figure out is whether DataID is worth the extra effort.

What would be the extra effort to include DataID in our workflow?

jakubklimek commented 8 years ago

Hi, First of all, as a European project, if we were to publish a data catalog, we should adhere to the EU standards for data portals, which is the DCAT-AP v1.1. (Actually there are only few required attributes, most is recommended or optional). For this, we have the corresponding components in LP-ETL. There are also components which, from this representation, can generate a CKAN or DKAN catalog. These cover the identified metadata needs in OBEU (see above).

Introduction of DataID would therefore mean addition of more components into each pipeline, unless it can work with the DCAT-AP v1.1 metadata resulting from current pipelines. Or it could be created manually for each dataset?

As @jindrichmynarz asked:

Yes, DataID allows us to model some advanced metadata features, however, is there an actual consumer of this additional metadata?

For the purpose of catalogization, DCAT-AP is sufficient for support of CKAN and DKAN. DataHub.io is a CKAN with modified metadata structure, for which we actually had a DPU in UnifiedViews for DCAT-AP v1.0 and, if requested, this component can be recreated in LP-ETL for DCAT-AP v1.1. But still, the additional metadata for datahub.io need to be entered manually, which is some additional effort and may be therefore optional.

@skarampatakis Could you maybe provide an example Turtle file describing one of the OBEU datasets using DataID so that its benefits are clearer and easier to assess?

skarampatakis commented 8 years ago

In my understanding, for the pipelines that we have already developed we could develop a new pipeline that would run an all transformed datasets, and create the DataID triples, exploiting triples already included in datasets. Graphs and dumps should then updated accordingly.

For future datasets, we will have to replace DCAT components with a DataID component.

For the purpose of catalogization, DCAT-AP is sufficient for support of CKAN and DKAN. DataHub.io is a CKAN with modified metadata structure, for which we actually had a DPU in UnifiedViews for DCAT-AP v1.0

I don't think that is already implemented but it seems technically possible, that using DataID, datasets are ready to be included in datahub.io or similar by the metadata itself. All the requiered data will be available plus statistics, like triple count, endpoint, example instance etc.

Compliance with DataHub: We will try to either establish a service that automatically transfers the DBpedia DataID metadata to http://datahub.io/ or prefereably get the datahub.io team to allow for automatic retrieval of DataID files by datahub.io in regular intervals.

I believe we could develop a single pipeline that would do this job in whatever CKAN instance. Or vice versa, develop a module that will harvest the metadata.

@skarampatakis Could you maybe provide an example Turtle file describing one of the OBEU datasets using DataID so that its benefits are clearer and easier to assess?

I will try to do so as soon as possible. I think at least we should try.