w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 46 forks source link

Summary statistics [RSS] #84

Closed jpullmann closed 3 years ago

jpullmann commented 6 years ago

Summary statistics [RSS]

Express summary statistics and descriptive metrics to characterize a Dataset.


Related use cases: Summarization/Characterization of datasets [ID33] 
makxdekkers commented 6 years ago

DQV should be able to meet these requirements.

dr-shorthair commented 5 years ago

Do we also need statistics on distributions? This requirement is suggested by the comment submitted by Daniel Pop [1]. Of course it also depends on how we resolve the matter of 'information equivalence' of different distributions.

[1] https://lists.w3.org/Archives/Public/public-dxwg-comments/2019Jan/0013.html

andrea-perego commented 5 years ago

I think we shouldn't prevent this - as done for other information. The question is how to put this option in the spec.

I'm a bit reluctant to explicitly add properties in class definitions where we don't have real-world use cases and/or implementation evidence. So, this could be included in the "guidance" part of the spec.

About "information equivalence", (again) -1 to it. This ends up to be a matter of the "granularity" of the notion of dataset, which is mainly a data provider choice (possibly also based on the requirements of the intended users).

makxdekkers commented 5 years ago

@dr-shorthair I am not quite sure how you derive a requirement for statistics on datasets? If there is a need for it, maybe we could refer to DQV or Data Cube? In my mind, Daniel's point (1) could be resolved by modelling the real-time data stream as a dcat:DataService and modelling the CSVs as separate datasets.

dr-shorthair commented 5 years ago

The statistic that Daniel mentioned is the frequency or spacing of members in a time series, where various distributions might have fixed spacing that is different (usually coarser) than what is available from the underlying dataset. I was on the point of creating an explicit issue for this aspect alone, but since this is an aspect of dataset statistics I thought it would be best to open the discussion here first.

dr-shorthair commented 5 years ago

@makxdekkers I did not derive a new requirement for dataset statistics - this was one of the original requirements taken from UCR.

However, I do wonder if time-series are such a common case that they might deserve special treatment. i.e. complement dct:temporal (coverage) with one more number - the item-accrual-periodicity. And since dct:accrualPeriodicity has been hijacked (in the DCAT context) to describe the publication period, it might have to be a new property? See #728

smrgeoinfo commented 5 years ago

If I understand correctly, the concept @dr-shorthair is looking for is named temporalResolution in ISO19115-1, and is important for evaluating datasets that have temporal coverage. There is a corresponding spatialResolution property that is equally important if you're evaluating spatial data.

dr-shorthair commented 5 years ago

@smrgeoinfo yes - I think we need to pair

And, while I'm a little wary of treading too far down a path that should be managed through a geospatial profile, since we already have

(and stop there).

andrea-perego commented 5 years ago

For spatial / temporal resolution, see UC15, which describes the general context and provides the relevant references.

These topics were discussed by the SDW WG, and then with the DWBP WG (in particular, with @aisaac and @riccardoAlbertoni ), which led to a proposal on how to specify it by using DQV.

The proposal is included as an example (focussing on spatial resolution only) in DQV, §6.13 (Express dataset precision and accuracy), which was in turn re-used into SDW's Best Practice 14 (Describe the positional accuracy of spatial data).

We should therefore re-use and consolidate that approach.

About consolidation, I summarised what I see as issues to be addressed in the context of the possible revisions to GeoDCAT-AP ( see https://github.com/semiceu/geodcat-ap/issues/3).

For our convenience, I copy-paste below the relevant text from https://github.com/semiceu/geodcat-ap/issues/3:

Basically, DQV models this information as observations / measurements of a given quality metric (which corresponds to a given type of resolution).

[...]

[Adopting] This [solution] would however require the definition of two groups of individuals:

  1. Those corresponding to the different types of resolution (denoting a quality metric).
  2. Those corresponding to each of the different levels of resolution (denoting the measurement of a specific quality metric).

As far as the first group is concerned (i.e., the different types of resolution), these individuals can be defined in DQV as follows:

:SpatialResolutionAsEquivalentScale a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as equivalent scale,
    by using a representative fraction (e.g., 1:1,000, 1:1,000,000)."@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .

:SpatialResolutionAsDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .

This initial list can be further extended. E.g.:

:SpatialResolutionAsHorizontalGroundDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as horizontal ground distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .

:SpatialResolutionAsVerticalDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as vertical distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .

:SpatialResolutionAsAngularDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as angular distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .    

The question is in which space such individuals should be defined [...].

The definition of individuals in the second group is however more problematic, since the level of resolution and unit of measurement are arbitrary (1:1000, 1:100, 1m, 1km, 100m, 10 decimal degrees, etc.).

Possible options include the following ones:

  1. Define only the individuals corresponding to the types of spatial / temporal resolution, whereas the individuals expressing the actual resolution will be defined at the data level. This solution is not optimal, since it will result in multiple definitions of the same individuals.
  2. Define individuals only for some levels of resolution and units of measurements - e.g., the most common ones. This solution may address the majority of (but not all) the cases.
  3. Set up a URI space supporting arbitrary levels of resolution and units of measurements. This register will dynamically generate the corresponding individuals based on information included in their URI.

An example of the last option, including also a proposal for how these individuals could be defined, is available at:

http://geodcat-ap.semic.eu/id/resolution/

dr-shorthair commented 5 years ago

I agree that DQV is competent to satisfy the requirement, as shown in the examples. However, I'm not sure it is optimal for meeting it in the DCAT context.

For example, the examples and the summary above present multiple kinds of 'spatial resolution', which may be important for sophisticated users. But pushing the basic case into this structure, and then depending on a subsidiary vocabulary for labels like 'SpatialResolutionAsDistance', adds two additional layers for concepts that are widely relevant and can be easily explained (and also note the dependency on SDMX as well ...).

Access to a single summary statistic for each would help a lot in the initial discovery phase.
Interoperability is almost always helped by limiting the options.

My proposition (above) is that for DCAT to work better for a large number of datasets, two statistics might be worth 'promoting' to be first-class properties for datasets, i.e. corresponding to:

makxdekkers commented 5 years ago

@dr-shorthair It would indeed be good if there was a simple way to expose resolutions. There is in any case a need to express both value and unit, so for spatial resolution the range would be (something like) schema:Distance, and for temporal resolution (something like) schema:Duration. Unfortunately, DCMI only has a class dct:SizeOrDuration, but not separate classes for Size and Duration. Should we define classes dcat:Distance and dcat:Duration?

andrea-perego commented 5 years ago

@dr-shorthair , I also agree that we need to address first the simplest use cases - and actually the reasoning in https://github.com/SEMICeu/GeoDCAT-AP/issues/3 was along those lines (the first example was about the two typical ways of expressing spatial resolution: distance and equivalent scale).

As @makxdekkers says, I see more an issue on the fact that we need to express value and unit of measurement, and however we do it, it is unlikely we end up with something simpler than the DQV approach, unless we inflate all these semantics in the one single term, and we allow the use of just 1 unit of measurement. E.g., by using properties like:

or

(or something along those lines).

smrgeoinfo commented 5 years ago

One issue with dqv is that in some engineering situations, resolution and precision are different. Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

andrea-perego commented 5 years ago

@smrgeoinfo wrote:

One issue with dqv is that in some engineering situations, resolution and precision are different.

Yes, the wording of the relevant section in DQV does not make this distinction, but the formal definition of the resolution in the examples does not bind the notion of resolution with the one of precision.

Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

Maybe schema:Duration can work, as it is using a standard syntax encoding, but schema:Distance uses a literal where value and a code for unit of measurement are separated by a space. Besides the problem of ensuring that codes for units of measurement are used consistently, this value is not machine-actionable. E.g., I won't be able to make a query to get the datasets using a spatial resolution with a distance less than 100 m.

Besides this, IMO, re-using Schema.org properties may lead to the issues mentioned in https://github.com/w3c/dxwg/issues/85#issuecomment-457961545 (in that case in relation to schema:startDate and schema:endDate).

dr-shorthair commented 5 years ago

@andrea-perego yes this is a bit of a perma-issue. There are too many representations of 'measure' or 'quantity' already, but none have achieved universal acceptance. Furthermore, most come with a lot of baggage (or at least are just one tiny part of some huge vocabulary, the rest of which we have little interest in in this context. That is the problem with your original DQV proposal: it makes the simple case hard.

So, taking a leaf out of Randall Munroe's book, I suggest crashing through and specifying this as the range of both dcat:temporalResolution and dcat:spatialResolution:

dcat:Measure a owl:Class . 
dcat:unitOfMeasure a rdf:Property ;
    rdfs:domain dcat:Measure .
dcat:amount a owl:DatatypeProperty ;
    rdfs:domain dcat:Measure ;
    rdfs:range xsd:decimal .

Which would mean that an instance would look like


<> a dcat:Dataset ;
    ...
    dcat:temporalResolution [
        a dcat:Measure ;
        dcat:amount 15.0 ;
        dcat:unitOfMeasure <http://www.w3.org/2006/time#unitMinute> ;
    ] ;
    dcat:spatialResolution [
        a dcat:Measure ;
        dcat:amount 30.0 ;
        dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ;
    ] ;
    ...
.
makxdekkers commented 5 years ago

@dr-shorthair While I do like the approach to provide a 'simple' solution for 'simple' cases, I do feel a bit uneasy to replicate something that is already there, i.e. the more 'fundamental' solution in DQV. If we promote this 'simple' solution, 'simple' cases -- using the DCAT-specific solution -- are not going to be interoperable with more 'complex' cases using a DQV-based solution. One could argue that by promoting a DCAT-specific approach, we are discouraging people to use a DQV-based approach and thus only cater for 'simple' cases to be handled by DCAT.

dr-shorthair commented 5 years ago

Yeah. On the one hand, I'm usually one of the first to advocate strongly for re-use of existing solutions, particularly if they are from the W3C stable and have clearly been designed to integrate. On the other I was somewhat put off by the complexity that is introduced as a further controlled vocabulary is required for the property semantics. I understand why DQV does it that way, to remain scalable and general. But we need to be sure that we want this to be reflected into DCAT. Furthermore, as has been noted before, DQV is not a Rec therefore officially it cannot be cited normatively;-(

Of course, all of these spatial and temporal properties (including the classic DCT ones) have non-simple values, so just the complexity re-appears a layer down anyway.

However, I think the mappings to DQV can almost certainly be formally expressed using OWL Restrictions and property-chain-axioms (e.g. see mappings from DCT to PROV here: https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat-prov.ttl#L63 ) so I'm not sure the interoperability argument made by @makxdekkers is strictly true.

andrea-perego commented 5 years ago

@dr-shorthair , working towards a simple solution:

Following up from @makxdekkers 's and @smrgeoinfo 's comment on schema:Duration, cannot we make dcat:temporalResolution a datatype property, with range xsd:duration?

Re-using your example, this would be something like:

<> a dcat:Dataset ;
    ...
    dcat:temporalResolution "PT15M"^^xsd:duration ;
    ...
.

Unfortunately, the same cannot be done for spatial resolution.

dr-shorthair commented 5 years ago

Good point. Temporal resolution was the thing that triggered this discussion, and it is more mainstream - one dimension is so much easier than two or three.

Spatial resolution (as distance) is still relatively simple conceptually but does need an explicit UOM. If only XSD had a 'measure' type (and every other programming language for that matter ... computer-science fail IMHO)

riccardoAlbertoni commented 5 years ago

@dr-shorthair wrote:

... dcat:spatialResolution [ a dcat:Measure ; dcat:amount 30.0 ; dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ; ] ;

I am not very convinced about the need to mint a new property for dcat:unitOfMeasure.

sdmx-attribute:unitMeasure is widely used, W3C recommendations such as RDF data cube use it, and I am concerned about introducing new patterns when there is one which is more or less well-accepted.

I see pros and cons in having both approaches : DQV/RDF DATA CUBE style and the DCAT properties. If we go for defining new dcat properties, I guess that we should anyway explicitly refer to SDW best practice which reuses DQV/RDF DATA CUBE for the more general cases.

dr-shorthair commented 5 years ago

Mind you, xsd:duration is not an OWL built-in https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes . So I'm thinking perhaps to leave the range open, but recommend use of xsd:duration?

dr-shorthair commented 5 years ago

See proposal for dcat:temporalResolution in branch https://github.com/w3c/dxwg/tree/dcat-issue84-tres-simon -

dr-shorthair commented 5 years ago

@riccardoAlbertoni I understand, but reluctant to introduce a new namespace (and a big one too!) in order to access just a single element. Your comment about W3C RDF Data Cube is correct, but unsurprising that it uses elements from SDMX since it was explicitly based on SDMX. I'd want to see use of sdmx-attribute:unitMeasure outside the data cube context to be convinced of the proposition that it is 'widely used'.

I did bring up the question of which existing vocabs we should have dependencies to in #111 but the discussion wandered off into PROV vs FOAF.

dr-shorthair commented 5 years ago

See proposal for dcat:spatialResolution in branch https://github.com/w3c/dxwg/tree/dcat-issue84-sres-simon -

kcoyle commented 5 years ago

Once again I wonder about the reluctance "to introduce a new namespace (and a big one too!) in order to access just a single element." Can someone explain why a full namespace must be imported for the use of a single element? This appears to be a processing rather than a vocabulary issue, and I am reluctant to proliferate properties that have the same meaning when some are already available.

dr-shorthair commented 5 years ago

Thanks Karen - my concern is that we have our eyes open about the risks and benefits of re-using elements from other vocabularies, and the granularity of re-use. #111 is the place to have a general discussion of this topic.

For the range of the proposed dcat:spatialResolution nothing would make me happier than to be shown a solution that saves us from having to bloat DCAT with new data-types (s.l.) like dcat:Measure, but I can't find a suitable class in any of the well-governed vocabularies in the W3C orbit. If anyone has a candidate please speak up! Failing that, we need a structure for scaled quantities, so a new class in DCAT is the fall-back solution. I'll be fine with using sdmx-attribute:unitMeasure for the scale if the team prefers to lean that way, but I thought putting it all in one namespace was a little cleaner and saves implementers from having to consult yet another whole spec for the definition of just one RDF term ...

Alternatively, we could just make it an owl:DatatypeProperty called dcat:spatialResolutionInMetres with a range xsd:decimal. I'm kinda leaning that direction given these other complications.

dr-shorthair commented 5 years ago

(shame there isn't an ISO standard for 'Length' complementing what ISO 8601 did for 'Time')

dr-shorthair commented 5 years ago

See revised proposal for dcat:spatialResolutionM in branch https://github.com/w3c/dxwg/tree/dcat-issue84-sres-simon - simplified with units of measure fixed to metres:

andrea-perego commented 5 years ago

+1 from me.

andrea-perego commented 5 years ago

I wonder whether we could consider adding properties for spatial resolutions not expressed as distance, namely, as equivalent scale - which is the other one most common way for specifying spatial resolution.

dr-shorthair commented 5 years ago

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

dr-shorthair commented 5 years ago

@riccardoAlbertoni Your contributions in Chapter 8 show some patterns for use of DQV for quality information.

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

agbeltran commented 5 years ago

I was looking for the same thing and the relevant bit that I found is this DQV section on statistics that relies on an extension of VoID and thus too oriented to RDF datasets.

riccardoAlbertoni commented 5 years ago

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

I am not aware of anything except the examples mentioned by @agbeltran for the statistics oriented to RDF datasets, perhaps @makxdekkers knows more ?!?.

Anyway, I guess there is more than one way to do it. For example, using RDF data cube you can define your own qb:DataStructureDefinition.

if you want to describe statistic of datasets such as Average, Max, Min for the "fields" in the dataset, you might define a qb:DataStructureDefinition whose dimensions/components include

If you provide statistics as quality indicators you can think of using DQV qualityMeasurement, for example defining a new dqv:dimensioni for each pair of field and operator.

andrea-perego commented 5 years ago

@dr-shorthair wrote:

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

I would also prefer to have one solution that fits all use cases, but we should also recognise that this two ways of expressing spatial resolution (i.e., distance and equivalent scale) are not comparable or convertable. So, IMO, the use of two different properties is more than acceptable.

BTW, my request is based also on an explicit requirement from GeoDCAT-AP - which is defining mappings from ISO 19115:2003, where spatial resolution is expressed either as distance or equivalent scale.

andrea-perego commented 5 years ago

Re-thinking about this, probably we should consider the option of specifying spatial resolution in 2 steps (which was one of the options discussed earlier):

a:Dataset a dcat:Dataset ;
  dcat:spatialResolution [
    dcat:distanceInMeters "15"^^xsd:decimal .
] .

One of the advantages is that it would be easier for people to reuse the main pattern dcat:spatialResolution / "specific property" in case they need to express this information in other ways (e.g., as per ISO 19115-1:2014, which includes also resolution as horizontal ground distance, vertical distance and angular distance).

davebrowning commented 5 years ago

@andrea-perego - do you see this issue as critical or can this be moved to the backlog?

andrea-perego commented 5 years ago

Partially critical (for the reasons I explained) but it can be moved to the backlog, provided that it will be possible to come back to this after DCAT v1.1 is out and possibly address it in the v1.2 release.

andrea-perego commented 3 years ago

I created a new issue to work on the discussion points still open:

https://github.com/w3c/dxwg/issues/1266

Closing this one.