Dataset series - Githubissues

dr-shorthair commented 5 years ago

A significant outstanding issue that keeps coming up as the sideline to other conversations [1][2] is the need to have a recommended pattern for cataloguing dataset series. Budget data, satellite imagery, ...

Usually most of the description (metadata) is the same, but the temporal coverage changes between members of a series.

[1] https://github.com/w3c/dxwg/pull/789 [2] https://github.com/w3c/dxwg/issues/806

nicholascar commented 5 years ago

Or the spatial coverage changes, as is the case with maps in a series

davebrowning commented 5 years ago

Fully agree that there would be great value in some story, though I'm a little unsure that there is a single pattern that can be recommended. Worth trying, anyway, now we seem to have a stable definition of the qualities of dcat:Distribution. In the spirit of starting a conversation:

The spatial series seems a good entry point here: in that case would each map be a dcat:Dataset with its own dcat:Distributions? [DCAT-rev definition of distributions means that you would have to do this, since the content of different maps is different.] If we just link these (with some variant of dct:relation or dcat:qualifiedRelation) then we have the connections but no holder for metadata common across the series. We could have a parent dataset (trying hard not to call it an atlas) which has the maps as constituent parts, which would hint at common metadata and use something like prov:specializationOf to identify the members of the series, though some kind of generic (cataloguable) container type would work too, I think. Or is all that too simplistic?

For the temporal case, we might be able to do the same kind of pattern, but I suspect that there are naturally more things at play there, or possibly other use cases. Versioning - or more specifically change through the progress of time - seems to me to become part of the picture very quickly, whereas for spatial and other kinds of series it would be a bit more orthogonal. Perhaps the temporal kind of series is better looked at via some kind of service paradigm. That might all just be my narrow/limited world view, though. :smile:

makxdekkers commented 5 years ago

As far as I am concerned, I don't think we're going to resolve this issue before finalising the CR. There are several dimensions to this, and I've been involved in long discussions that did not lead to a resolution after many months. Maybe we should move it to the list for version 3, and set up a sprint for it after publication of the CR?

davebrowning commented 5 years ago

+1 to Makx's timing suggestion - I was assuming so, actually. I think there is some value in having such things in the Future Priority backlog - it may solicit further input or use cases.

dr-shorthair commented 5 years ago

Yes, sorry, I did not make it clear at the top of the issue that I intended this for the backlog. Definitely too late to do anything reasonable for DCAT 2. I just wanted to have a discrete issue to track.

dr-shorthair commented 5 years ago

The Interest Group on Data Discovery Paradigms of the Research Data Alliance has a task force on Data/metadata granularity, which is considering a taxonomy of data aggregation. @andrea-perego and @dr-shorthair are in contact and will likely bring more detailed requirements that can form the basis of developing DCAT patterns for this.

der commented 5 years ago

Sorry to be late with this comment, haven't been following this work.

As an outsider to this WG can I reinforce the importance of this issue. This has been, and continues to be, a substantial pain point in our attempts to use dcat. Sad to hear that it won't be addressed for DCAT 2.

In our experience with public sector datasets it is relatively rare for a dataset to be a unitary thing which be downloaded in its entirety. More typically the non-realtime datasets we see comprise a series of updates (annual, quarterly, monthly etc as determined by some release cycle). Where possible we provide data services and dumps for the whole series. However, both users and publishers want to explicitly see the series of updates as individual elements they can separately download but regard the collection of those updates as a single dataset with common metadata and want the data, and presentation of it, to reflect that.

Possible approaches to this include:

Model each such dataset as a dcat:Catalog which then references each update as a separate dcat:Dataset with it's own distribution but put all the common metadata on the dcat:Catalog. This could work but then it is hard for a generic client to tell the difference between this use of Catalog and the "normal" uses of dcat:Catalog as (possibly hierarchical) collections of heterogeneous datasets. It's also hard to then point to a Distribution for the whole dataset. It would be possible to support this pattern through a marker subclass of catalog (dcat:DatasetSeries or some such).
Use dcat:Dataset for the series but allow a dataset to have multiple partial distributions, each with a separate temporal/spatial/other extent. This could work but the existing text and UML doesn't encourage per-Distribution extents and implies that a Distribution covers a whole dataset. Furthermore if you have different formats available for each update then the relation between the different partial Distributions would be obscure.
Introduce a separate notion of a partition or element of a dataset which can have it's own extent information and its own Distribution(s). This is the route we've used up to now and works fine within our own systems but means that an external client expecting dcat can't see the individual elements in the series. Sadly this is usually the grain size a harvester actually wants to see.

Even if you can't recommend a specific pattern for DCAT 2 would you be able to give some indication of the likely direction of travel (as a guide to those of us who need to work around the limitation in the meantime)?

makxdekkers commented 5 years ago

@der My opinion is that, if we want to support dataset series in DCAT -- and I agree it is a very common request -- the best approach might be to define a new class (subclass of dcat:Resource), e.g. dcat:DatasetSeries. The related datasets could then be linked using dct:hasPart and dct:isPartOf, or similar properties. The advantage is that we would not be 'repurposing' existing classes which carries the risk that 'legacy' DCAT implementations might not understand what is going on. A new class would allow to explicitly describe the behaviour of the series and the datasets in it. For example, if the behaviour would include a certain 'inheritance' of 'common metadata' from the series to the individual datasets, that could be made clear in the description of the new class.

der commented 5 years ago

Thanks @makxdekkers. That would work for me.

In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset. That way we could give a distribution (and extent) for the overall aggregate as well as distributions for the individual elements within the series. It would also give us a transition plan - publish resources now as dcat:Datasets (with the dct:hasPart and dct:isPartOf relationships to other dcat:Datasets for the elements) and then add declarations for rdf:type dcat:DatasetSeries when/if that becomes available with compatible semantics.

Do you think there's a chance of squeezing a non-normative indication of this as a possible future pattern into the doc? Or at least a comment that use of dct:hasPart/isPartOf on datasets is in principle legal? Not sure how close to CR you are so appreciate this might be too late.

I mention this because https://www.w3.org/TR/vocab-dcat-2/#Property:catalog_has_part implies that you have domain/range declarations for dct:hasPart which would mitigate against this pattern. I'm assuming this is just a confusing presentation that what you actually have are owl:allValuesFrom restrictions, and so not a problem.

makxdekkers commented 5 years ago

@der I was just expressing my personal opinion, and the group might not agree. In any case, I think we need to discuss this in more detail before making some sort of statement for the future -- it would not be good to make such a statement now and then to backtrack on it...

smrgeoinfo commented 5 years ago

for future reference (v3?) I agree DatasetSeries should be a separate subclass of dcat:Resource. Noting that a series would have all the properties that are specific to dataset, from a modeling perspective it might be treated as a subclass of Dataset, with the addition of a mandatory(2..N) 'hasPart' relationship, and properties indicating how the 'granules' in the collection are defined (time, space...).

dr-shorthair commented 5 years ago

Yes @smrgeoinfo that is my thinking as well.

Richer treatment of relations between resources (esp. datasets) is one of the features that has been added in DCAT2, so we have the platform already.

https://www.w3.org/TR/vocab-dcat-2/#qualified-relationship

matthiaspalmer commented 5 years ago

A very simple solution is to point to multiple resources by repating dcat:downloadURL from a single distribution. If needed additional properties like dct:title, dct:temporal, dct:spatial can be provided on these resources.

We do it like this in EntryScape, allowing people to upload or point to multiple resources.

makxdekkers commented 5 years ago

@matthiaspalmer If you repeat dcat:downloadURL on a single dcat:Distribution, how do you relate a property like dct:temporal to a particular dcat:downloadURL?

matthiaspalmer commented 5 years ago

@makxdekkers I was thinking on adding additional properties on the resource I pointed to via dcat:DownloadURL. Like this:

<#di1> a dcat:Distribution ;
          dcat:downloadURL  <#downloadablefile1> ;
          dcat:downloadURL  <#downloadablefile2> .
<#downloadablefile1> dct:title "Budget for 2018"@en ;
          dct:temporal "2018"^^xsd:gYear .
<#downloadablefile2> dct:title "Budget for 2019"@en ;
          dct:temporal "2019"^^xsd:gYear .

Note 1: The temporal expression could of course be made with a startdate and enddate instead to indicate a timespan, I simplified for the example. Note 2: Providing a dct:temporal expression on the dataset would be complementary in the sense that it indicates the whole timespan for all of the resources pointed to.

jakubklimek commented 5 years ago

@matthiaspalmer The problem with your solution is that temporal coverage is something a catalog user would like to search for. Therefore, it is a property of a dataset. In your case, you would have:

<#ds2008> a dcat:Dataset ;
    dct:title "Budget for 2018"@en ;
    dcat:distribution <#di1> .
<#di1> a dcat:Distribution ;
    dcat:downloadURL <#downloadablefile1> .

<#ds2009> a dcat:Dataset ;
    dct:title "Budget for 2019"@en ;
    dcat:distribution <#di2> .
<#di2> a dcat:Distribution ;
    dcat:downloadURL <#downloadablefile2> .

This also allows you to use richer metadata for each downloadable file.

Next, using DCAT2, you would have to implement a dataset relation grouping those 2, probably by introducing a third dataset representing the series and using https://www.w3.org/TR/vocab-dcat-2/#qualified-relationship. The unresolved issue here is that this relation is very common, but currently left to application profiles to be defined, and therefore will not be interoperable among DCAT2 catalogs.

Btw. this was my requirement for DCAT2 all along (https://github.com/w3c/dxwg/issues/806) to add this relationship explicitly. Unfortunately, it was dropped in the process, postponed for DCAT3, which means we will have to push for its implementation in DCAT-AP, and, if not possible, do it in a Czech AP (DCAT-AP-CZ) and then change it when it eventually makes its way to DCAT.

matthiaspalmer commented 5 years ago

@jakubklimek I was describing an alternative solution to what you described. In fact I was describing the way we are doing it in Sweden right now, quite a few datasets have been expressed using repeated downloadURLs. Additional properties on the resources like dct:title and dct:termporal may be expressed in individual data catalogs but unfortunately at this stage they are not harvested to the national data portal and therefore not further to the EDP.

I fail to see a problem with the approach I outlined. Having a temporal coverage on both the dataset level and on individual files allows people to search for datasets the way you described.

Note1: I would argue that the solution I outlined is very natural as it well aligned with the basic semantics of RDF. I.e. you make statements about resources referenced in the subject position of a triple. All other constructions requires additional semantics to be specified. Note 2: What I am suggesting is much simpler and requires less overhead for maintainers of datacatalogs. Alternatively, if solved by the tool, much less complexity in implementations. Note 3: If the solution I outlined is not enough, you need to say more things about an individual downloadable file, then nothing prohibits the other option to be used in that case if the way to express relations between datasets can be clarified. However, I fail to see a usecase where you need the full strength of the metadata on a dataset to describe the difference between two downloadable files.

makxdekkers commented 5 years ago

@matthiaspalmer I would argue that you are extending the DCAT model. The problem is the one of files that are related in different ways than currently foreseen by the Catalog-Dataset-Distribution chain. What I think you are doing is adding an extra class below the Distribution, i.e. DownloadableFile (even if you don't explicitly declare that) with additional metadata. In general, DCAT implementations would expect the content of dcat:downloadURL to be, well, the URL of the actual file. That is probably the reason that your solution does not play well with the European Data Portal because that implementation does not include the logic to handle the extra implicit class. I think the issue is clear, that people need a way to group files together that are related in other ways than by having different formats of the same data, but it needs additional modelling, as also discussed at https://github.com/w3c/dxwg/issues/1085. Any proposed solutions are welcome as input to further discussions that may take place in the development of DCAT 3, but it seems to me to be too early to decide on the best way forward.

matthiaspalmer commented 5 years ago

@makxdekkers I agree, adding additional properties on these resources (downloadable files) constitutes in some sense to extending the model from the perspective of DCAT. If this behaviour should be acknowledged and described in DCAT it would be appropriate to encourage the use of a new class like dcat:DownloadableFile.

However, I would like to point out two things:

Just repeating dcat:downloadURL Repeating the property without providing additional properties would be within the current scope of DCAT. In many cases it is enough to just be able to point to several files that taken together make up a single distribution. This is what we are doing currently in Sweden from the harvester perspective.

It depends on how explicit you want the DCAT model to be Even if I would prefer inclusion of such a class for clarity in the DCAT model, I would like to point out that there are already similar situations with other properties when this is not a requirement. For instance, I think it is relevant to compare with dcat:endpointDescription, it's range is just described as rdfs:Resource and in many cases it will point to resources that for most practical purposes are downloadable files, like OpenAPI or WSDL. Hence, very close to dcat:downloadURL. But the usage note also says that it is ok to embed an entire Hydra expression directly in the graph, which will indicate that the resource is not neccessarily a downloadable file. The only difference is between using dcat:downloadURL and dcat:endpointDescription in this case is in the second usage note. Adding such a note to dcat:downloadURL would be quite easy, right?

A very similar argument could be made for dct:rights which may point to a file or be a graph structure using something like ODRS.

makxdekkers commented 5 years ago

@matthiaspalmer One can argue that any behaviour is allowed as long as a specification does not explicitly prohibit it. I could point out that the definition of dcat:downloadURL -- The URL of the downloadable file in a given format -- seems to indicate, by using the article 'the', twice, that there should only be one. But indeed the definition does not say that explicitly so we could argue our positions ad infinitum.

However, the more important point for me is that it is clear that we have an issue that needs to be further discussed and for which we need to try to find a commonly agreed solution. I think it would be good to have that discussion but to have it in a wider context, because I have seen several use cases around this issue of more general relationships between data files, beyond format.

Of course, you are fully in your right to implement this any way you want, but to me it seems important that we find a commonly agreed way to model this case and similar cases, so that the solution is interoperable. We need to have this discussion but I don't think it would be wise to quickly paste something into the upcoming Candidate Recommendation. I feel it needs a more thorough discussion, and your proposal would certainly be one of the inputs into that discussion.

matthiaspalmer commented 5 years ago

@makxdekkers Fully agree that this issue should not be rushed. My intention was to provide input to the discussion about another (arguably simpler) solution. In lack of better agreed upon alternatives we (EntryScape platform and the current recommendation in Sweden) will continue to use this solution for expressing timeseries as the expression is close to trivial and requires little or no deviation from the model of DCAT.

One final note, in DCAT-AP the dcat:downloadURL property has the cardinality 0..n. while dct:format property has the cardinality of 0..1. I have assumed that someone with insight (e.g. you) provided clarifications of the intentions of the original DCAT when providing these explicit cardinalities. Maybe I have overinterpreted this, but at least you see were I am coming from.

makxdekkers commented 5 years ago

@matthiaspalmer

One final note, in DCAT-AP the dcat:downloadURL property has the cardinality 0..n. while dct:format property has the cardinality of 0..1. I have assumed that someone with insight (e.g. you) provided clarifications of the intentions of the original DCAT when providing these explicit cardinalities.

Unfortunately, we don't have the issue tracker that we used for the development of DCAT-AP in 2013 anymore, but I can see that the first versions of the specification had 0..1. This changed to 0..n sometime around the end of April 2013, but I can't find out why.

kcoyle commented 5 years ago

This actually makes sense to me, although my interpretation may not be what others see. Logically you can have only one format per resource since format defines the format of the resource in the subject position. However, if you have more than one copy, e.g. you have mirror sites, then you can have more than one download url. Each download URL links to a file in that single format.

jakubklimek commented 5 years ago

@kcoyle This case, i.e. mirrors of the exactly same file, is the only case when it makes sense to me to have multiple downloadURLs, i.e. when all metadata of the dataset and distribution apply to all the linked files.

@matthiaspalmer When there is a difference in the files, such as budget for various years, I lean towards splitting this into datasets and defining their relations properly. Splitting a distribution into multiple files arbitrarily makes it harder to interpret what their relation is. Regarding your notes:

I think your approach actually makes it harder to interpret DCAT catalogs, because you would add another level (Dataset-Distribution-File) where metadata may be present, when what you want to desribe can be done with what we already have (Dataset-Distribution, Dataset relations). Therefore, implementations would have to look for relevant metadata in one more place.

Note1: I would argue that the solution I outlined is very natural as it well aligned with the basic semantics of RDF. I.e. you make statements about resources referenced in the subject position of a triple. All other constructions requires additional semantics to be specified. Note 2: What I am suggesting is much simpler and requires less overhead for maintainers of datacatalogs. Alternatively, if solved by the tool, much less complexity in implementations.

I agree that it is natural from the RDF point of view, but my comments regarding additional complexity apply here. It would be easier only if your approach would be recommended instead of the already specified one. When it is in addition, it is adding complexity and I can foresee implementations that implement only parts of DCAT because of this, making them not interoperable in the end.

Note 3: If the solution I outlined is not enough, you need to say more things about an individual downloadable file, then nothing prohibits the other option to be used in that case if the way to express relations between datasets can be clarified. However, I fail to see a usecase where you need the full strength of the metadata on a dataset to describe the difference between two downloadable files.

What if you want to actually describe (for machine readability) the nature of the relationship of the files, e.g. that it is in fact a time series, not a split according to geospatial features? This is exactly what is done in DCAT2 on the level of datasets already (partially).

matthiaspalmer commented 5 years ago

@kcoyle yes I agree with the mirror case, to distinguish from that case you would need to provide additional triples like I described, alternatively you could just provide some information in the dct:description of the distribution. Not perfect but it works as long as we have humans looking at the metadata.

@jakubklimek well, as I see it, if I have to choose between:

an approach which is close to the RDF information model which requires just a slight comment on the current model of DCAT and only a few extra triples OR
creating separate datasets even though I do not consider them to be separate, maybe create a top-level "abstract dataset" to connect them all, use the dcat:Relation construction in DCAT for creating relations between them, duplicate certain metadata to make them findable and overall create a lot more triples.

I would go for option 1 every day and I am sure a lot of other people in the linked data world would argue the same way.

I cannot resist to throw another log on the bonfire here: The datamodel is one thing, but I think it is important to consider the perspective of providing a good user experience in a data portal. For instance, we are helping a few organizations that have suppliers' ledgers (and other datasets) where additional files are added every month. If many datasets are realized as 30+ file-oriented datasets (and growing all the time), how would you find a nice overview or a starting point when you browse or search? Maybe this can be compensated by smart designs in a data portal that understands these relations, but that is certainly making life more complicated for portal developers.

And if portal providers do not compensate for this, well, It is not going to make developers looking for datasets very happy to find a set of interconnected datasets rather than a single dataset with a list of files. As a developer myself, I feel the need to have the "best" information model is making things impractical and awkward that will just scare away one (or several) of the main target groups of doing this in the first place.

makxdekkers commented 5 years ago

... you can have only one format per resource since format defines the format of the resource in the subject position. However, if you have more than one copy, e.g. you have mirror sites, then you can have more than one download url. Each download URL links to a file in that single format.

Mirroring could indeed have been the reason for the 0..n cardinality in DCAT-AP.

makxdekkers commented 5 years ago

... Maybe this can be compensated by smart designs in a data portal that understands these relations, but that is certainly making life more complicated for them.

It seems to me that, if files with different data are modelled as distributions of one dataset, the data portal also has to be smart enough to understand the relationship between the distributions. In both cases, you need the logic to make sense of the structure. Personally, I find a model that says that all Distributions have the same data easier to understand, and easier to program for, than a model that says that Distributions may have the same data but they also may not -- and the system needs to figure out which is which. But smart programming gets around any and all obstacles, I know!

And if portal providers do not compensate for this, well, It is not going to make developers looking for datasets to use very happy to find an ever growing list of interconnected datasets rather than a single dataset with a list of files. As a developer myself, I feel the need to have a the "best" information model is making things impractical and awkward that will just scare away one (or several) of the main target groups of doing this in the first place.

I often find the argument that 'developers' can be 'scared away' by complexity a bit strange. Developers usually work for someone and have a job to deliver a product or service, and I am pretty sure that they are smart enough to build systems that work for their customers based on the data that is there.

In my mind, it is a question of optimisation -- I think that the current DCAT model is optimised, or focused if you will, on the simpler situations, and not on complex cases (it's not the DCAT model that requires complex solutions, but the complexity is in the real world). Of course, we could decide at some point in time that the complex cases are the majority and therefore the model needs to be changed in order to optimise it for the complex cases, but we need to look at the evidence to decide that we are in that situation. And there always needs to be a balance -- we don't want to optimise the model for time series and make life harder for people with other types of relationship between data files.

matthiaspalmer commented 5 years ago

@makxdekkers I think you misunderstood me. I fully agree that distributions corresponds to different representation of a dataset. Multiple distributions should not be used to point to individual files that together form the dataset. What I am proposing is that a single distribution is made up of several files pointed to by repeated dcat:downloadURL. I have said this higher up in the thread, but maybe it got lost in all the comments.

Furthermore, my argument that developers are going to get this wrong is based on experience, not speculation. Implementation of harvesting software on a national level have shown severe problems in getting the existing DCAT-AP right ranging from not managing to produce correct RDF to expressing more than half of the fields wrong. I have seen this in at least four existing vendors. I think the reason is a combination of lack of knowledge of RDF, low prioritity on standards compliance as well as the effort needed to make preexisting information models fit with the specification. But maybe I have had bad luck and other have better experience with harvesting from different vendors, I am sure people at EDP can tell you a lot more about this.

I represent a company that take pride in building everything on top of RDF and linked data principles, hence it is in our DNA to go the extra mile to get it right semantically. But still, at the end of the day we have to make our customers happy which implies a good user experience. We cannot force them to create one new datasets per uploaded file, that would make no sense. Potentially we can hide this from them by treating certain datasets as files, but that will require some careful thought and copy pasting of metadata between these file-oriented datasets.

I think an important aspect of an information model like DCAT is that its semantics should feel natural. In the current specification it says in the DCAT scope:

A dataset is a collection of data, published or curated by a single agent. Data comes in many forms including numbers, words, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

If for some reason the data provider (the agent) needs to divide the dataset into smaller parts due to its size or due to practical maintenance issues, that is a question of how to access its representation. It should not put bounds on the scope of the dataset. If a data owner think of their budget as a single dataset because it is described, published and curated in a unified manner should we then tell them that, no, that is not a dataset because you have divided it into multiple files?

And what would happen if the data provider happen to provide the budget via an API in addition to downloadable files? Having one dataset per downloadable file will now become really weird because each of these datasets would need to have two distributions, one for the downloadable file and the second for the API with some restriction (a parameter) allowing you to access exactly the same information as is available in the downloadable file. It is not certain that the API would even necessarily support this as it would depend on the way the downloadable files have been divided.

With the approach I outlined it would simply be one dataset with two distributions. The first distribution would point to the downloadable files via repeated dcat:downloadURL and the second distribution to the API (potentially using a DataService instance).

makxdekkers commented 5 years ago

@matthiaspalmer I understood your approach, I think correctly, as repeating the downloadURL in a single distribution pointing to several files that have different contents. My argument is that that approach is based on your personal reading of the specification, one that I think is stretching both the letter and the intention of the specification too far.

Other than that, this discussion is very interesting and I hope that we can reach some consensus in the course of developing DCAT version 3. For the time being, let's not jump to conclusions. As I wrote, you have every right to build systems based on what you feel is natural, but it is not guaranteed that your solution is widely interoperable.

dr-shorthair commented 5 years ago

STAC is important emerging standard for spatial data series - mostly for continuous/imagery and similar. We should make sure to align with its approach and language.

proccaserra commented 4 years ago

I am closely following this issue and would be interested in obtaining guidance from the group on how best to represent a dcat:Dataset that would have dcat:Dataset as its parts. If I understand correctly, the group leans towards creating dcat:DatasetSeries class but this won't be happening before v3 so do I get right that the recommended way would be to rely on dcat:qualifiedRelation should anyone attempt to represent such an aggregate dataset?

I was wondering if the group could document an example on how to do so?

Finally, I was wondering if the group could document what is the main reason against adding hasPart relation to dcat:Dataset class?

makxdekkers commented 4 years ago

@proccaserra As far as I am concerned, this group should not recommend a particular way to model dataset series until we have started and completed that discussion for V3. I am not even sure that this group leans towards creating a class dcat:DatasetSeries, but that should be one of the options to be discussed.

proccaserra commented 4 years ago

@makxdekkers thank you for this insight, much appreciated. Could you (or anyone from the group comment) on my other questions? I owe the group an explanation: I am trying to create DCATv2 based JSON-LD context files for DATS (data article tag suite JSON schema) and I am currently weighing my options since DATS allows DATS.Dataset type to have DATS.Dataset as part. I am currently considering dcat:qualifiedRelation but I'd like to understand 1) if this would accept an partOf value 2) if option 1 is valid, what is the main reason for not including the hasPart relation in DCATv2?

dr-shorthair commented 4 years ago

@proccaserra dct:hasPart is specifically mentioned here: https://www.w3.org/TR/vocab-dcat-2/#Property:resource_relation Note that we also very much accept the OWA rules, so any other RDF property or class can be used if required. I would encourage you to experiment with patterns for describing data-series in a DCAT context, and then bring your experience to the table so we can learn.

(Personally I'm a little skeptical if hasPart is enough because it does not give you a way to characterize the nature of the part-whole relationship. At the very least it will likely need to be supplemented by relationships between individual parts so you can characterize a sequence and other 'topological' relationships within a series - be they temporal, spatial, versioning, or on some other dimension.)

Note that dcat:Resource is intended to provide an extension point for specialized applications.

makxdekkers commented 4 years ago

I'd like to point back to my message in June 2017, in which I tried to identify a number of types of 'versioning' relationships:

Evolution: for example, a dataset that is published with year-to-date information; every week or month, new, recent data is appended to the existing data.
Replacement: for example, existing data was wrong in some way, and a new dataset is published that replaces the old data.
Snapshots: for example, continuously changing data like the state of traffic or weather maps with hourly snapshots.
Time series: for example, annual budget data.
Conversion: for example, data that is transformed from one coordinate system to another, or from one set of units to another; similar to translation of textual resources.
Lower/higher granularity: for example, maps in different scales, images in different resolutions, compression like MP3 versus CD sound, and summaries of large amounts of data.

There is also the use case ID32 Relationships between Datasets with some scenarios.

dr-shorthair commented 4 years ago

@makxdekkers is 'Time series' different to 'Evolution'?

dr-shorthair commented 4 years ago

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset. Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

andrea-perego commented 4 years ago

@dr-shorthair said:

@der said 'In our case we would likely want to treat each of our resources as an instance of both dcat:DatasetSeries and dcat:Dataset'

I would propose - if a new class is needed at all - that dcat:DatasetSeries rdfs:subClassOf dcat:Dataset. Else just set dcterms:type <http://registry.it.csiro.au/def/isotc211/MD_ScopeCode/series> ; (or similar)?

GeoDCAT-AP uses the latter approach - see the section on resource types and related example:

## Resource type for series
[] a dcat:Dataset;
  dct:type <http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series> .

makxdekkers commented 4 years ago

@makxdekkers is 'Time series' different to 'Evolution'?

That depends on how these things are defined. The way I think about it is something like this:

Time series: a group of datasets that are related along a time dimension, for example a dataset with the budget for 2019 and another dataset with the budget for 2020; so two datasets that contain the same type of data for a different time period

Evolution: a single dataset that is updated 'in situ' over time with additional or modified data, for example a dataset with year-to-date expenditure data; so a single dataset that changes over time

There are cases where you could model data either way; for example, in the case of YTD information, you could publish a snapshot every time it changes as a dataset with timestamp, or add additional data in the same dataset. It's up to the publisher to decide which one fits the needs of the users. I know a case where a YTD is updated in situ but then published as a snapshot every six months.

agreiner commented 4 years ago

Hm, typical usage in my own circles is that a time series is a dataset that has time as one variable within that one dataset. I would suggest avoiding using the term to talk about a series of datasets, to avoid confusion.

kcoyle commented 4 years ago

Series v Evolution - just to give some support to this, library data recognizes series as:

"Serial: Bibliographic item issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. Includes periodicals; newspapers; annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc., of societies; and numbered monographic series, etc. "

Basically, issued serially over time; a succession of parts or entries or files.

"Integrating resource [kc: terrible name, but like Makx's "evolution"]: Bibliographic resource that is added to or changed by means of updates that do not remain discrete and are integrated into the whole. Examples include updating loose-leafs and updating Web sites. Integrating resources may be finite or continuing."

I think "serial" and "updated" / "integrated" are pretty common patterns. The difficulty is in giving them clear names and definitions. And of course there will be some materials that are a bit of both, and I have no idea how to handle those in a user-friendly way.

andrea-perego commented 4 years ago

Discussion on this topic is also going on in the framework of DCAT-AP.

The following posts provide a survey on how dataset series (and versions) are dealt with in DCAT-AP extensions:

https://github.com/SEMICeu/DCAT-AP/issues/155#issuecomment-670503623

https://github.com/SEMICeu/DCAT-AP/issues/155#issuecomment-711944145

riccardoAlbertoni commented 4 years ago

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call. see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

andrea-perego commented 4 years ago

I have prepared a wiki page with a starting example depicting dataset time series. Do not hesitate to complete the page with other alternative examples and integrate the page If I have overlooked any of the above discussion's key points. I hope that having common examples to reason upon might help to stabilize a solution in the next DCAT call. see https://github.com/w3c/dxwg/wiki/Examples-on-dataset-series

Thanks, @riccardoAlbertoni .

I've added some examples, and made a few editorial changes.

riccardoAlbertoni commented 4 years ago

This issue was automatically closed by the last PR merge. We need this open, as we want to collect feedback on this issue with the next FPWD. Am I right @andrea-perego?

agbeltran commented 4 years ago

should this issue also be referenced in the Editors' note stating "The creation of a specific class for dataset series is under discussion." or should we rather open an specific issue for that discussion?

riccardoAlbertoni commented 4 years ago

I would suggest creating a new GitHub Issue, in which we can reprise the discussion quoting the views already expressed in the existing GitHub issue.

andrea-perego commented 4 years ago

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge. We need this open, as we want to collect feedback on this issue with the next FPWD. Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

riccardoAlbertoni commented 3 years ago

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge. We need this open, as we want to collect feedback on this issue with the next FPWD. Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

andrea-perego commented 3 years ago

@riccardoAlbertoni said:

This issue was automatically closed by the last PR merge. We need this open, as we want to collect feedback on this issue with the next FPWD. Am I right @andrea-perego?

I'm actually more in favour of closing this one, and creating a new issue once feedback will be submitted.

In that case, should we get rid of the issue mentioned in the FWPD? or we plan to leave the mention of closed issues to provide context?

I suggest we decide about this during our next call.

w3c / dxwg

Dataset series #868