Closed larjohn closed 7 years ago
We currently store metadata generated by the e-datasetMetadata and e-distributionMetadata DPUs in a separate named graph. @jakubklimek, can you please comment on this alternative solution?
OK, I don't see what i wrong with using the existing metadata DPUs. The metadata in the mentioned datasets and the requirements mentioned in this issue are covered by those.
I think this is mostly a matter of packaging data. While we use the metadata DPUs to produce a separate named graph, the Greek datasets mentioned by @larjohn directly contain metadata.
The proposals are also compatible only if we use the qb:DataSet
s IRI as the IRI of its named graph. This may be limiting in case we want to have multiple instances of qb:DataSet
in one dataset corresponding to one named graph. Provided that we haven't encountered this case (unless I overlooked it), I think we can postpone this concern and adopt a guide that the qb:DataSet
s IRI is used as the IRI of its named graph.
I think this does not matter much.
As far as I know, this is exactly the case, the IRI of the qb:DataSet matches the IRI of the named graph. I had one dataset split into 3 with the esf projects in the Czech Republic and I used this pattern also, as can be seen in the pipieline.
Well, since @larjohn raised this point, I think it matters to agree on a convention on which visualization OpenBudgets.eu tools can rely.
I think visualization tools should not rely on a particular splitting of data into named graphs because that would unnecessarily fixate them on some non-standardized convention. They should work with all triples in the datastore, queying for necessary data and therefore I think it does not matter whether this is in one graph or in two.
Right. I didn't mean to agree on convention for splitting data into named graphs, but on a convention for attaching metadata. Metadata DPUs attach metadata to the IRI of the dataset's named graph. In the case of Greek datasets metadata is attached to instances of qb:DataSet
. If we adopt a convention that qb:DataSet
s IRI is used as the IRI of the dataset's named graph, these 2 conventions would have the same effect. The only problem with this convention that I see is that is does not specify what happens when there are multiple instances of qb:DataSet
in a dataset. I expected this to happen rarely, but you said you needed this for the Czech ESD data.
Actually I said I avoided it in the Czech data by splitting it into 3 data graphs (and 3 metadata graphs) and I think this can be done every time - split the data into graphs according to the qb:DataSets, which only seems natural when from DCV point of view, they are also different datasets.
I see. I misread your previous comment. Splitting datasets per qb:DataSet
into multiple named graphs definitely works. I'm only afraid of the extra work it incurs. Each loader and metadata DPU must be configured separately for each named graph.
Ultimately, this issue leads to a need to maintain a document with the conventions we adopted for the datasets of OpenBudgets.eu. Do you have a preference how to maintain these conventions?
Well, the question still is whether this needs to be covered by a convention or whether it does not matter. In my opinion, the conventions should only contain items that, when not followed, break some functionality or user friendliness.
If a convention for this is to be established, I of course prefer our approach as it is always easier to merge 2 graphs if necessary than to split them.
The document with conventions is a google doc, so it can be amended as necessary, right?
Prompted by @larjohn mentioning metadata regarding interoperability, I'd like to resolve this issue. Would it be OK if we fleshed out the recommended metadata section in D1.5? Alternatively, we can also add validation of the required metadata into openbudgets/pipeline-fragments. @larjohn, is there any metadata that is a must for your work?
I am mostly concerned for the metadata currently presented by the OpenSpending viewer:
Additionally, most important is the Organization, the type of the organization and maybe some kind of location, either inside the organization, or outside of it. An ideal (to me) dataset would contain three more dataset dimensions:
Please excuse me if any of these extras are already included in the schema.
LP-ETL DPU e-datasetMetadata can provide all the metadata for OS viewer.
dcterms:title
dcterms:description
dcterms:creator
, however only IRI can be providedI think only dcterms:creator
is problematic. However, it is problematic in two ways. First, the DPU allows to provide only an IRI of the author. If we need properties describing authors (names, email addresses), then either only dereferenceable IRI can be used and their representations must contain the required properties or the DPU must directly allow filling in the required properties. Second, it is unclear who the author is. Is it the person who transformed the dataset? Is it the original government body that published the data? I think lot of time was already spent discussing these questions in the context of DCAT and DCAT-AP.
Regarding the organization, obeu-dimension:organization
(or its subproperty) is a required dimension of OpenBudgets.eu data. Organization is represented as org:Organization
(or its subclass, e.g., from the Core Public Organization Vocabulary). Organization type can be represented via rov:orgType
from the Registered Organization Vocabulary or cpov:classification
from the Core Public Organization Vocabulary. Would you suggest code list to use for organization types? EU Publications Office offers Organization type code list but it doesn't seem to fit very much. Location should be a property of organization, such as org:hasSite
or schema:address
.
Guidance on metadata to be provided with OpenBudgets.eu datasets is already included in D1.4, but it is very generic, so we can provide more concrete recommendations in D1.5.
I will have a look!
In the meantime, what would you suggest for the time dimension?
Do you mean temporal coverage of a dataset? DCAT recommends to use dcterms:temporal for temporal coverage, which links to a dcterms:PeriodOfTime
with start and end. Given that the suggested representation of start and end is not RDF, I'd simply use schema:startDate
and schema:endDate
instead.
However, temporal coverage can be automatically computed from the dataset, using obeu-dimension:fiscalYear
, obeu-dimension:fiscalPeriod
, or obeu-dimension:date
, so it's probably not necessary to provide it manually in the metadata.
I was wondering what were the expected values for the fiscalYear dimension.
D1.4 recommends to use interval:Year
as range of obeu-dimension:fiscalYear
. The instances of interval:Year
can be drawn from <http://reference.data.gov.uk/id/gregorian-year/{YYYY}>
. It seems most datasets in this repository already follow this convention.
Some datasets use British years (i.e. <http://reference.data.gov.uk/id/year/{YYYY}>
) instead. I must admit I don't know the difference between Gregorian and British year. According to Wikipedia, it seems that Great Britain adopted Gregorian calendar a long time ago, so I presume there's no practical difference between Gregorian and British year.
Before closing this issue, let's make sure, each dataset has the following information, according to what is discussed so far here:
This can be all entered in the existing metadata components in LP-ETL. Currently, I am working on the DCAT-AP v1.1 version of those components (https://github.com/linkedpipes/etl/tree/feature/dcatAp11) which would be even more suitable for use and I expect them to be ready by the end of this week.
My recommendation is to wait for them and use them for this task.
Cool, automating this is the way to go!
As of now, the old metadata components are deprecated and the new DCAT-AP Dataset and DCAT-AP Distribution components should be used. They implement the current DCAT-AP v1.1 specification.
The organization, dataset name and fiscal period items are quite clear in DCAT-AP v1.1 and I suggest we store the author/uploader of the dataset as a contact point.
Would Data ID cover all off our needs? https://github.com/dbpedia/DataId-Ontology. Regarding that is on top of DCAT, VoID, Prov-O and FOAF I believe that it could be a possible candidate.
I think DataID is powerful. However, providing all that metadata takes effort. I'm not sure if such effort would pay off in our use cases. Do you think that it addresses a need that we have that cannot be addressed by the much more basic set of metadata proposed above?
I believe many of the problems discussed here and on openbudgets/platform#13 can be adressed. I also like the fact that metadata are attached to the dataset itself. Additionally, applications that will be implemented based on DataID would give additional value. Compliance with datahub.io and automated publishing, or versioning to name a few. Sure is too early, as DataID is on development as Markus Freudenberg said, but I think we should explore this direction thoroughly.
I think that the transition is easy as we already use parts of the DataID pilars.
I believe many of the problems discussed here and on openbudgets/platform#13 can be adressed.
Can you say which problems are not addressed by the current proposal?
I also like the fact that metadata are attached to the dataset itself.
The current proposal also attaches metadata to the dataset (i.e. instance of qb:DataSet
).
Additionally, applications that will be implemented based on DataID would give additional value. Compliance with datahub.io and automated publishing, or versioning to name a few. Sure is too early, as DataID is on development as Markus Freudenberg said, but I think we should explore this direction thoroughly.
Could you give an example of a concrete benefit you foresee to get for the OpenBudgets.eu use cases that would justify the effort in providing DataID metadata?
Can you say which problems are not addressed by the current proposal?
Agent roles, reffering DataID motivation below :
By defining rights and responsibilities of agents together with the dataset metadata deals with common uncertainties as to whom to contact about a dataset or who published certain datasets (and many more).
Also automated publishing on LOD Cloud, is a wanted aspect. I know that we can already have this functionality using LP, but it requires additional step on the pipelines. We can have all in once using DataID.
Generally speaking, I believe that we have tried our best to find a best practice about metadata, which is pretty close to what DataID is offering. From my point of view, DataID wraps up what we have already recognised as needs, with a few additions, giving added value by the applications that will be developed around DataID. Plus it is an upper level ontology. It is an early adoption I can say, if we go that direction, but I believe it worths it.
In terms of roles, DCAT-AP Dataset component gives us the distinction betweendcterms:publisher
and dcat:contactPoint
. If the use case is to clarify whom to contact about a particular dataset, then I think it is already addressed by dcat:contactPoint
.
Also automated publishing on LOD Cloud, is a wanted aspect.
What do you mean by automated publishing on LOD Cloud? I don't think DataID automates this procedure.
Generally speaking, I believe that we have tried our best to find a best practice about metadata, which is pretty close to what DataID is offering.
I agree. What I'm trying to figure out is whether DataID is worth the extra effort.
Plus it is an upper level ontology.
In what sense do you mean it's an upper level ontology? Upper ontology is a cross-cutting ontology, such as SUMO, while DataID is specific to the domain of dataset description.
DataID provides more descriptive terms of agent roles. You can check the whole model over here.
What do you mean by automated publishing on LOD Cloud? I don't think DataID automates this procedure.
LOD publishing refers to datahub.io or similar data hubs. Not to draw a diagram. If a module to harvest datasets metadata from DataID metadata will be developed, we coud benefit. So metadata will be published not only on specific, let say CKAN, instance, but on every compliant data hub.
In what sense do you mean it's an upper level ontology?
In the sense that it functions as a wrapper of other mid-level ontologies. Maybe upper is not the best fit for it's description.
I agree. What I'm trying to figure out is whether DataID is worth the extra effort.
What would be the extra effort to include DataID in our workflow?
Hi, First of all, as a European project, if we were to publish a data catalog, we should adhere to the EU standards for data portals, which is the DCAT-AP v1.1. (Actually there are only few required attributes, most is recommended or optional). For this, we have the corresponding components in LP-ETL. There are also components which, from this representation, can generate a CKAN or DKAN catalog. These cover the identified metadata needs in OBEU (see above).
Introduction of DataID would therefore mean addition of more components into each pipeline, unless it can work with the DCAT-AP v1.1 metadata resulting from current pipelines. Or it could be created manually for each dataset?
As @jindrichmynarz asked:
Yes, DataID allows us to model some advanced metadata features, however, is there an actual consumer of this additional metadata?
For the purpose of catalogization, DCAT-AP is sufficient for support of CKAN and DKAN. DataHub.io is a CKAN with modified metadata structure, for which we actually had a DPU in UnifiedViews for DCAT-AP v1.0 and, if requested, this component can be recreated in LP-ETL for DCAT-AP v1.1. But still, the additional metadata for datahub.io need to be entered manually, which is some additional effort and may be therefore optional.
@skarampatakis Could you maybe provide an example Turtle file describing one of the OBEU datasets using DataID so that its benefits are clearer and easier to assess?
In my understanding, for the pipelines that we have already developed we could develop a new pipeline that would run an all transformed datasets, and create the DataID triples, exploiting triples already included in datasets. Graphs and dumps should then updated accordingly.
For future datasets, we will have to replace DCAT components with a DataID component.
For the purpose of catalogization, DCAT-AP is sufficient for support of CKAN and DKAN. DataHub.io is a CKAN with modified metadata structure, for which we actually had a DPU in UnifiedViews for DCAT-AP v1.0
I don't think that is already implemented but it seems technically possible, that using DataID, datasets are ready to be included in datahub.io or similar by the metadata itself. All the requiered data will be available plus statistics, like triple count, endpoint, example instance etc.
Compliance with DataHub: We will try to either establish a service that automatically transfers the DBpedia DataID metadata to http://datahub.io/ or prefereably get the datahub.io team to allow for automatic retrieval of DataID files by datahub.io in regular intervals.
I believe we could develop a single pipeline that would do this job in whatever CKAN instance. Or vice versa, develop a module that will harvest the metadata.
@skarampatakis Could you maybe provide an example Turtle file describing one of the OBEU datasets using DataID so that its benefits are clearer and easier to assess?
I will try to do so as soon as possible. I think at least we should try.
Can we consistently provide metadata inside our datasets?
Such metadata would be:
You can find sample implementation in the Thessaloniki and Athens datasets.