Open nichtich opened 1 year ago
I know that this is not the same, but there is dcat:byteSize
: https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_size
Just out curiosity, @nichtich could you provide a use case where users are depending on an exact value of the notion of size? I hear this request sometimes, but I have not encountered a user that is using it in its data selection process or data processing.
For me there are some reasons that size is not included.
With the introduction of APIs the need of size becomes very limited. For APIs size becomes temporal dependent and since most data portals assume that metadata changes slowly (ones a week is a quick pace ;-) ) the property looses it value. (if the data is only harvested once a week, then the importance of the accuracy reduces.)
I see it more featuring in file downloads, but even then I am not so sure if there is need to be exact.
E.g. as @akuckartz mentions there is the bytesize for a distribution. But normally users do not care about the exact number: they care maybe more about the time it takes to download.
In the practice the bytesize is not featuring in a human decision process.
Another use case could be the guarantee one has that the file is completely downloaded, but then checksum is a better choice to build an integrity check upon.
Although the need for expressing size feels very natural, in the practice I seldom see publishers providing it because the high effort to keep track of sizes (both human and technical investment ). Therefore I am curious about the use case that would motivate publishers to provide size information.
To my mind, number of records is a human-facing bit of info that gives a person an idea of the scope of the information prior to downloading. Number of bytes is reminiscent of those large software downloads in times past when you needed to know that the download had completed. However, for very large files it is useful to know that they ARE very large - which today means multiple gigabytes. For smaller files I doubt if byte size matters.
Therefore, both measures are needed but are useful under specific circumstances.
The number of records gives information about content. It is useful to judge and compare both different datasets of same type (same method to cound records) and change of a dataset over time. See http://nomisma.org/datasets for an example of a list of datasets with number of records each. This example happens to use dcterms:hasPart
with a blank node and void:entities
to give the number, e.g.:
<http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:hasPart [ rdf:type dcmitype:Collection ;
dcterms:type nmo:TypeSeriesItem ;
void:entities 3650
] ;
dcterms:hasPart [ rdf:type dcmitype:Collection ;
dcterms:type nmo:Monogram ;
void:entities 309
] .
I am not sure whether this is best practice and applicable to other kinds of datasets, for instance number of files.
By the way DataCite has a free text property that maps to dcterms:extent
. According to my understanding of http://dx.doi.org/10.6084/m9.figshare.2075356, the example above would be:
<http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent [
rdf:type dcterms:SizeOrDuration ;
rdf:value "3650 type series items"
] ;
dcterms:extent [
rdf:type dcterms:SizeOrDuration ;
rdf:value "309 monorams"
] .
or (what I would prefer)
<http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent "3650 type series items";
dcterms:extent "309 monograms" .
I also found the Ontology of units of Measure to support this:
<http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent [
rdf:type om:Measure ;
om:hasNumericalValue 3650
] ;
dcterms:extent [
rdf:type om:Measure ;
om:hasNumericalValue 309
]
Last but not least Wikidata uses P4876 to specify the number of records, see this list of databases with their number of records.
@bertvannuffelen interestingly, in a related topic, while a little work was put into adding DatasetSeries in DCAT3, no-one suggested the need for a count of items in a series.
I agree that both file size and dataset size are useful. In the world of high-performance computing, unfortunately, the times of needing to know whether a download completed haven’t yet receded into the past. A few gigabytes are not large in this realm. File movements at the terabyte to hundreds of terabytes level are common, so special tools are needed, and care must be taken to maximize throughput without causing trouble for others on the network. I often field queries from users about how to go about moving a dataset from one storage tier to another or from one site to another. So, size definitely can matter and should be expressible. Another potentially important piece of the puzzle is the number of inodes (files or directories) involved when the dataset is unpacked, since some storage can be finicky about storing or reading from many small files. The number of rows in a data table can also matter to whether it can be fit into a certain type of database or can be manipulated with certain analysis tools. Often the number of rows maps in a general way to the usefulness of a scientific dataset, though depending on the dataset, its size may be better expressed in more domain-specific terms, like degrees of the sky for astronomical data, or spatial resolution for climate data.
Regarding series, I think the expectation is that a series will grow, so expressing a count of items in the series becomes meaningless very quickly.
@all, a little as expected, there are very different, yet specific, expectations of size.
I observe the following:
To get a harmonised view the size will be a complex datatype, having properties:
I see the following challenges:
From this I see sizes feature more in a specific profile of DCAT for a specific usage case. I think the diversity makes it hard to come of a consolidated approach. I also believe that my suggestion of an extended datatype will not be adopted because it will be perceived a too complicated. But introducing a property with a value space that is ambiguous to interpret is (e.g. is "100" =? 100 coins or 100 records or 100 TB) is also not a good idea. Therefore it is better that each ecosystem defines in its own namespace the size of its needs. It gets the best of both worlds: the ecosystem can express it, and the semantics are clear. And if the profile is well published than anyone can interpret it.
If the semantics of the semantics of size is left to the ecosystem to define in a profile, then my opinion is to not include the property in DCAT, but immediately push it to the profile. Introducing "abstract properties" that should not be used is not very helpful.
@agreiner introduces an interesting notion "usefulness of a dataset". That is I think the key of the story here. Size could play a role in such an assessment, but that is very user and use case specific. I might be biased, but I think size is overrated in this assessment. Other properties will play probably a more important role (as size is not provided often cfr the challenges I listed).
I think it would be good to provide evidence from existing data portals and communities where size is a critical and well maintained properties, before introducing a property.
Thanks @bertvannuffelen for the summary. Size indeed depends a lot on context.
To get a harmonised view the size will be a complex datatype, having properties:
- value: the number
- unit: what is counted
- method: the method of counting
This goes beyong the original request. Just a cardinal number and a unit what is being counted is enough. There are several ways to express it in RDF:
There already are unit-specific properties such as dcat:byteSize
, void:triples
, void:entities
, wd:P4876
... In my opinion some units are frequent and generic enough to justify a DCAT property, e.g. number of files or number of records. At least DCAT should mention VOID vocabulary to be used to specify size of RDF datasets. For units not supported or mentioned by DCAT, the specification should recommend using dcterms:extent
and tell how to specify number and unit.
I think it would be good to provide evidence from existing data portals and communities where size is a critical
Numbers beyond number of bytes are common, just browse around in any data catalog. I just looked at the first topic I thought of (astronomy) and found two examples within a minute:
Additional examples are listed in my recent comment.
@nichtich the two examples are interesting:
https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq number of landmarks (very domain-specific unit)
The "size information" is actually part of the description and not an independent number. Also from the description I am not sure if the dataset publisher would like to share a single number:
- 10,433 detected landmarks
- 62,598 augmented landmarks
- 73,031 total landmarks.
But I believe the publisher liked to explain the nature of the data. And by accident, the numbers fitted in the textual description.
Observe that this description also ties the description of a dataset to its size. That means that the intend of this dataset is that its evolution is very static.
https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz - number of rows and columns (very generic unit)
The portal allows to export it in CSV, RDF, XML. So the size is here not a metadata value but a service offering of the portal in case it can offer the data directly. It is calculated dynamically I assume (or on upload by the publisher).
That means you get some size indication for CSV but not for RDF. If I am a dataset publisher and I offer 2 distributions and only for CSV I have to provide a size and not for the RDF offering, what does that mean for my RDF users? Do I as publisher provide a lower quality service or an equal quality service?
The latter are important questions as in the end publishers should be instructed to perform for all entities they share a common metadata quality. If a publisher would add for one distribution a format indication and for another not, then this would be usually considered problematic. (This relates to the challenges I mentioned.)
In general the following statements should be clear what they mean, without additional explanation.
<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "10,433".
<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "62,598".
<https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz> _:size "15.4K"
Ps. I randomly clicked in data.europa.eu and I could not find any examples. Maybe bad luck, but that also indicates that the size is not often provided. That is the reason I asked for example portals where size is an important and critical feature for the functioning of that data community. In the NASA data portal the size provisioning is at hoc and probably depending on the dataset owner. I would like to see for instance, data portals that offer based on size different access patterns or payment requirements, etc. At this moment the examples are only those cases where either a) a publisher did some editorial work or b) the data is available in a data warehouse and it calculates some number. I really would like to discuss more inspiring cases than these. Because those usecases will drive publishers to provide more precise and quality metadata.
But I see where you are heading, your request is to "officially" adopt dct:extent
to document the size of a resource.
As I wrote before adopting such abstract wide property is not the challenge. For 'dct:extent' it is even implicit the case, as I hope the DXWG is first adopting terms from dcterms and only when no fit for purpose is found, from another namespace.
I suggest that any profile builder should apply that approach too.
The challenge is the request for harmonising the value space in some way.
As the examples illustrate there is no commonality yet. Thus the value space stays open and the decisions are to be made by the implementing profile.
If adopting this reasoning as a usage note is helping the community, I do not object to add that to the DCAT specification. It will however not resolve the work from any implementer to make its own profile rules. And I have the feeling you aim for that.
I have added the label "future work", as the DXWG group voted for the CR publication, and the process does not allow including new features at this stage. Future DCAT standardization processes can consider this issue.
I could not find any information how to express the number of records in a dataset (also known as its size). There was a deprecated property
dcat:size
subclass ofdcterms:extent
, so my guess would be to just usedcterms:extent
(with any kind of value: number, string, blank node, URL...) or a more specific property from another vocabulary (e.g. statistics properties from void vocabulary). The general size of a dataset in terms of conceptual entities (records, concepts, resources, objects...) is fundamental information, so dcat should at least mention the topic, explain why there is no strictly defined property and refer todcterms:extent
.