w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
153 stars 47 forks source link

How to specify the number of records in a dataset #1571

Open nichtich opened 1 year ago

nichtich commented 1 year ago

I could not find any information how to express the number of records in a dataset (also known as its size). There was a deprecated property dcat:size subclass of dcterms:extent, so my guess would be to just use dcterms:extent (with any kind of value: number, string, blank node, URL...) or a more specific property from another vocabulary (e.g. statistics properties from void vocabulary). The general size of a dataset in terms of conceptual entities (records, concepts, resources, objects...) is fundamental information, so dcat should at least mention the topic, explain why there is no strictly defined property and refer to dcterms:extent.

akuckartz commented 1 year ago

I know that this is not the same, but there is dcat:byteSize: https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_size

bertvannuffelen commented 1 year ago

Just out curiosity, @nichtich could you provide a use case where users are depending on an exact value of the notion of size? I hear this request sometimes, but I have not encountered a user that is using it in its data selection process or data processing.

For me there are some reasons that size is not included.

With the introduction of APIs the need of size becomes very limited. For APIs size becomes temporal dependent and since most data portals assume that metadata changes slowly (ones a week is a quick pace ;-) ) the property looses it value. (if the data is only harvested once a week, then the importance of the accuracy reduces.)

I see it more featuring in file downloads, but even then I am not so sure if there is need to be exact. E.g. as @akuckartz mentions there is the bytesize for a distribution. But normally users do not care about the exact number: they care maybe more about the time it takes to download.
In the practice the bytesize is not featuring in a human decision process. Another use case could be the guarantee one has that the file is completely downloaded, but then checksum is a better choice to build an integrity check upon.

Although the need for expressing size feels very natural, in the practice I seldom see publishers providing it because the high effort to keep track of sizes (both human and technical investment ). Therefore I am curious about the use case that would motivate publishers to provide size information.

kcoyle commented 1 year ago

To my mind, number of records is a human-facing bit of info that gives a person an idea of the scope of the information prior to downloading. Number of bytes is reminiscent of those large software downloads in times past when you needed to know that the download had completed. However, for very large files it is useful to know that they ARE very large - which today means multiple gigabytes. For smaller files I doubt if byte size matters.

Therefore, both measures are needed but are useful under specific circumstances.

nichtich commented 1 year ago

The number of records gives information about content. It is useful to judge and compare both different datasets of same type (same method to cound records) and change of a dataset over time. See http://nomisma.org/datasets for an example of a list of datasets with number of records each. This example happens to use dcterms:hasPart with a blank node and void:entitiesto give the number, e.g.:

<http://numismatics.org/pco/>
        rdf:type             void:Dataset ;
        dcterms:hasPart      [ rdf:type       dcmitype:Collection ;
                               dcterms:type   nmo:TypeSeriesItem ;
                               void:entities  3650
                             ] ;
        dcterms:hasPart      [ rdf:type       dcmitype:Collection ;
                               dcterms:type   nmo:Monogram ;
                               void:entities  309
                             ] .

I am not sure whether this is best practice and applicable to other kinds of datasets, for instance number of files.

By the way DataCite has a free text property that maps to dcterms:extent. According to my understanding of http://dx.doi.org/10.6084/m9.figshare.2075356, the example above would be:

<http://numismatics.org/pco/>
  rdf:type void:Dataset ;
  dcterms:extent [
    rdf:type dcterms:SizeOrDuration ;
    rdf:value "3650 type series items"
  ] ;
  dcterms:extent [
    rdf:type dcterms:SizeOrDuration ;
    rdf:value "309 monorams"
  ] .

or (what I would prefer)

<http://numismatics.org/pco/>
  rdf:type void:Dataset ;
  dcterms:extent "3650 type series items";
  dcterms:extent "309 monograms" .

I also found the Ontology of units of Measure to support this:

<http://numismatics.org/pco/>
  rdf:type void:Dataset ;
  dcterms:extent [
    rdf:type om:Measure ;
    om:hasNumericalValue 3650
  ] ;
  dcterms:extent [
    rdf:type om:Measure ;
    om:hasNumericalValue 309 
   ]

Last but not least Wikidata uses P4876 to specify the number of records, see this list of databases with their number of records.

dr-shorthair commented 1 year ago

@bertvannuffelen interestingly, in a related topic, while a little work was put into adding DatasetSeries in DCAT3, no-one suggested the need for a count of items in a series.

agreiner commented 1 year ago

I agree that both file size and dataset size are useful. In the world of high-performance computing, unfortunately, the times of needing to know whether a download completed haven’t yet receded into the past. A few gigabytes are not large in this realm. File movements at the terabyte to hundreds of terabytes level are common, so special tools are needed, and care must be taken to maximize throughput without causing trouble for others on the network. I often field queries from users about how to go about moving a dataset from one storage tier to another or from one site to another. So, size definitely can matter and should be expressible. Another potentially important piece of the puzzle is the number of inodes (files or directories) involved when the dataset is unpacked, since some storage can be finicky about storing or reading from many small files. The number of rows in a data table can also matter to whether it can be fit into a certain type of database or can be manipulated with certain analysis tools. Often the number of rows maps in a general way to the usefulness of a scientific dataset, though depending on the dataset, its size may be better expressed in more domain-specific terms, like degrees of the sky for astronomical data, or spatial resolution for climate data.

agreiner commented 1 year ago

Regarding series, I think the expectation is that a series will grow, so expressing a count of items in the series becomes meaningless very quickly.

bertvannuffelen commented 1 year ago

@all, a little as expected, there are very different, yet specific, expectations of size.

I observe the following:

  1. the number of entities in a distribution (e.g. coins)
  2. the number of data structure elements in a distributions (e.g. rows)
  3. a qualification of the number (e.g. small, medium, ...)
  4. the effect on the storage infrastructure (e.g. inodes)

To get a harmonised view the size will be a complex datatype, having properties:

I see the following challenges:

From this I see sizes feature more in a specific profile of DCAT for a specific usage case. I think the diversity makes it hard to come of a consolidated approach. I also believe that my suggestion of an extended datatype will not be adopted because it will be perceived a too complicated. But introducing a property with a value space that is ambiguous to interpret is (e.g. is "100" =? 100 coins or 100 records or 100 TB) is also not a good idea. Therefore it is better that each ecosystem defines in its own namespace the size of its needs. It gets the best of both worlds: the ecosystem can express it, and the semantics are clear. And if the profile is well published than anyone can interpret it.

If the semantics of the semantics of size is left to the ecosystem to define in a profile, then my opinion is to not include the property in DCAT, but immediately push it to the profile. Introducing "abstract properties" that should not be used is not very helpful.

@agreiner introduces an interesting notion "usefulness of a dataset". That is I think the key of the story here. Size could play a role in such an assessment, but that is very user and use case specific. I might be biased, but I think size is overrated in this assessment. Other properties will play probably a more important role (as size is not provided often cfr the challenges I listed).

I think it would be good to provide evidence from existing data portals and communities where size is a critical and well maintained properties, before introducing a property.

nichtich commented 1 year ago

Thanks @bertvannuffelen for the summary. Size indeed depends a lot on context.

To get a harmonised view the size will be a complex datatype, having properties:

  • value: the number
  • unit: what is counted
  • method: the method of counting

This goes beyong the original request. Just a cardinal number and a unit what is being counted is enough. There are several ways to express it in RDF:

  1. property with custom datatype: rarely used at all and problematic because they need to be mapped to XSD number types
  2. unit-specific property with number as value: easy to use but they need to be defined for each unit
  3. generic size property with string as value: easy for humans, little use for computation
  4. generic size property with blank node object having number, unit and optional more details (date, method...): most flexible but blank nodes are unpleasant and it requires at least three properties to be defined

There already are unit-specific properties such as dcat:byteSize, void:triples, void:entities, wd:P4876... In my opinion some units are frequent and generic enough to justify a DCAT property, e.g. number of files or number of records. At least DCAT should mention VOID vocabulary to be used to specify size of RDF datasets. For units not supported or mentioned by DCAT, the specification should recommend using dcterms:extent and tell how to specify number and unit.

I think it would be good to provide evidence from existing data portals and communities where size is a critical

Numbers beyond number of bytes are common, just browse around in any data catalog. I just looked at the first topic I thought of (astronomy) and found two examples within a minute:

Additional examples are listed in my recent comment.

bertvannuffelen commented 1 year ago

@nichtich the two examples are interesting:

example a)

https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq number of landmarks (very domain-specific unit)

The "size information" is actually part of the description and not an independent number. Also from the description I am not sure if the dataset publisher would like to share a single number:

- 10,433 detected landmarks
- 62,598 augmented landmarks
- 73,031 total landmarks.

But I believe the publisher liked to explain the nature of the data. And by accident, the numbers fitted in the textual description.

Observe that this description also ties the description of a dataset to its size. That means that the intend of this dataset is that its evolution is very static.

example b)

https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz - number of rows and columns (very generic unit)

The portal allows to export it in CSV, RDF, XML. So the size is here not a metadata value but a service offering of the portal in case it can offer the data directly. It is calculated dynamically I assume (or on upload by the publisher).
That means you get some size indication for CSV but not for RDF. If I am a dataset publisher and I offer 2 distributions and only for CSV I have to provide a size and not for the RDF offering, what does that mean for my RDF users? Do I as publisher provide a lower quality service or an equal quality service?

The latter are important questions as in the end publishers should be instructed to perform for all entities they share a common metadata quality. If a publisher would add for one distribution a format indication and for another not, then this would be usually considered problematic. (This relates to the challenges I mentioned.)

In general the following statements should be clear what they mean, without additional explanation.

<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "10,433".
<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "62,598".

<https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz> _:size "15.4K"

Ps. I randomly clicked in data.europa.eu and I could not find any examples. Maybe bad luck, but that also indicates that the size is not often provided. That is the reason I asked for example portals where size is an important and critical feature for the functioning of that data community. In the NASA data portal the size provisioning is at hoc and probably depending on the dataset owner. I would like to see for instance, data portals that offer based on size different access patterns or payment requirements, etc. At this moment the examples are only those cases where either a) a publisher did some editorial work or b) the data is available in a data warehouse and it calculates some number. I really would like to discuss more inspiring cases than these. Because those usecases will drive publishers to provide more precise and quality metadata.

But I see where you are heading, your request is to "officially" adopt dct:extent to document the size of a resource.
As I wrote before adopting such abstract wide property is not the challenge. For 'dct:extent' it is even implicit the case, as I hope the DXWG is first adopting terms from dcterms and only when no fit for purpose is found, from another namespace. I suggest that any profile builder should apply that approach too.

The challenge is the request for harmonising the value space in some way.
As the examples illustrate there is no commonality yet. Thus the value space stays open and the decisions are to be made by the implementing profile. If adopting this reasoning as a usage note is helping the community, I do not object to add that to the DCAT specification. It will however not resolve the work from any implementer to make its own profile rules. And I have the feeling you aim for that.

riccardoAlbertoni commented 1 year ago

I have added the label "future work", as the DXWG group voted for the CR publication, and the process does not allow including new features at this stage. Future DCAT standardization processes can consider this issue.