Sub-classes of StatisticalDataset

FranckCo commented 5 years ago

Decided during May 7 meeting: define what sub-classes of StatisticalDataset we want.

Example candidate: TimeSeries.

Linked to issue #6.

JALinnerud commented 3 years ago

Event History?

JALinnerud commented 3 years ago

Checking GSIM v1.2 Dataset Definition: An organized collection of data. Explanatory text: Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, field.

abrycsaba commented 3 years ago

There was an ADMIN VIP project in the EU, named ADMIN. That is what I found for data classification: https://ec.europa.eu/eurostat/cros/content/statistical-data_en

In my opinion we should take into consideration only those statistical data classifications (from point of view of aggregation or level of process e.g) which can be used as parameters for IT application later on handling those kind of data. This parameter can trigger different kind of tasks for different kind of data sets in an IT application.

JALinnerud commented 3 years ago

I remember that pre-GSIM we had classifications, but while creating GSIM it was pointed out that our classifications are more strict than other classifications ( its elements are mutually exclusive and complete) so we were persuaded to call the GSIM information object Statistical Classification. I am not convinced that our datasets are any different from anyone elses datasets. I do not see an advantage in creating a specialisation. Our GSIM Datasets do inherit from Identifiable Artefact so maybe that is an essential difference? Maybe we could just use themes or domains to say that our datsets are within statistics? The same could be done for almost all GSIM information objects so that we do not need to put 'Statistical' in front of them all. Or maybe we could use a name space gsim: ?

zoltanvereczkei commented 3 years ago

What do our ModernStats models say about statistical dataset (or dataset in general)?

GSBPM

Does not mention data sets. Basically it doesn’t have to. It’s a process model.

GSIM

• Data Set: An organized collection of data. Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, and field. • Unit Data Set: A collection of data that conforms to a known structure and describes aspects of one or more Units. Example: A synthetic unit record file is a collection of artificially constructed Unit Data Records, combined in a file to create a Unit Data Set. Synonyms: Micro data, unit data, synthetic unit record file • Dimensional Data Set: A collection of dimensional data that conforms to a known structure. • Information Set: Organized collections of statistical content. Statistical organizations collect, process, analyse and disseminate Information Sets, which contain data (Data Sets), referential metadata (Referential Metadata Sets), or potentially other types of statistical content, which could be included in additional types of Information Set.

GAMSO

Does not mention datasets. It doesn’t have to as data sets are information objects.

CSDA

Despite the fact that we do not take CSDA into consideration there is a classification for data sets in the document, which is as follows:

• Explorative: Data that is obtained from outside sources, is usually “sampled” and is used to assess the nature, structure and quality (usability) of that data source. After the exploration, this data in most cases loses its value. • Organizational: The true (data) assets of the organization, that are to be treated as such and must be protected and shared where possible. An important sub-type of “Organizational” is the Master Data such as statistical registers, back-bones of populations, collections of statistical units. For instance: Company register, People Register, Buildings register. • Temporary, local: Data that is produced as an intermediate product in a statistical process and has no real value outside that process. This data usually loses its value after the process (cycle) is completed, but may have value for the next cycle as a reference. May be persisted within the process space

Conclusion and suggestion:

Only such classification should be taken on board which are relevant from the point of view of information management meaning that these kind of datasets have to be handled in a different way (other process, other methods etc.). The sub-classes should also has a statistical perspective to make it easy for the user (statistician) of the ontology to understand the sub-classes we define.

GSIM provides one kind of classification (according to structure) for data sets, which is • Unit data set • Dimensional data set

This also corresponds to the breakdown of microdata/tabular (aggregated) data. We think that this breakdown provided by GSIM is a good basis. If we need further breakdown but we need to agree on the purpose.

• We can differentiate unit data set by source (data collection, data transmission, other / unimode, multimode, etc.) for by phases where the dataset is made available (corresponding to GSBPM Phases IV, V. VI, VII). • We can classify dimensional data set by the type of data included (Nominal, Ordinal, Discrete, Continuous), by domains, or sensitivity (SDC perspective).

Maybe we can move forward with the currently available two sub-classes defined by GSIM. To be honest, this more like a feeling than a well-based opinion…

FlavioRizzolo commented 3 years ago

Datasets can be classified in so many dimensions that's really hard to come up with a comprehensive list. Summing up the postings above, some of them are:

scope: explorative, organizational, local (as per CSDA)
domain: social, economics, etc.
granularity: micro (unit), aggregate (dimensional)
sensitivity: privacy/confidentiality related (public vs. confidential? some scale from 1 to n?)

I'd add also

status: preliminary, edited/imputed, final, revised, etc.

I'm not sure how to classify nominal, ordinal, discrete, continuous because I don't know what the first two mean, and time series, longitudinal, event history either... Types of data?

ChLaaboudi commented 3 years ago

The Code Lists used in DCAT are available in EU Vocabularies: Relevant for us:

Theme (13 themes used for classifying datasets in EU and European open data portals)
Access right (Sensitivity)

A code list CL_CONF_Status is available in the SDMX Global Registry.

JALinnerud commented 3 years ago

Sometimes I struggle to find the human readable content under EU Vocabularies. A hint is to click on the blue button Browse content on the right hand side after you have chosen your vocabulary. For dataset type that takes you to https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://publications.europa.eu/resource/authority/dataset-type For access rights that takes you to https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://publications.europa.eu/resource/authority/access-right

JALinnerud commented 3 years ago

ONS has a data sensitivity model and a content sensitivity model. In Statistics Norway we have adopted 4 levels of Privacy from the ONS model: private, confidential, commercial, open. ONS also had a Sensitivity Assessment tool that we have translated to Norwegian and are implementing in our organisation as part of our GDPR compliance.

dgillman4909 commented 3 years ago

These are all interesting comments. I think Flavio is on the right track. He comments there are many dimensions by which to create subtypes of datasets. I propose we follow the definition of dataset from GSIM - organized collection of data - and build subtypes based on organization. Other criteria for identifying subtypes are not germane from that point of view, and I comment below on why I think we should not base subtypes on them.

In DDI-CDI, 4 basic structural types of organizing data sets have been defined: rectangular, event history, key-value pair, and dimensional. Several of the types could be used to structure the same data. There is not a canonical structure in all cases, though some data is much more amenable to one structure over the others.

The types are defined roughly as follows:

rectangular (or wide) - rows are units and columns are variables
event history (or tall or long) - rows are based on the value for each variable, one unit at a time - and this could be visualized as rows are variables and columns are units
dimensional - a pre-defined set of cells defined by the combination of categories, one from each of a set of dimensions (category sets), used to handle the value of some measure (variable) restricted to the cell
key-value - a set of values, each associated with some key

Dimensional data are usually associated with aggregates. Key-value data are often taken from scraping the web. Even-history is used to describe events over some time period.

The nominal, ordinal, interval, ratio are not used to differentiate datasets. Rather, they are families of datatypes used to describe variables. Nominal data are those conforming to a finite set of categories with no other conditions (sex categories). Ordinal data are those conforming to an ordered finite set of categories, but the difference between adjacent categories is not necessarily uniform (Likert scale measures of satisfaction). Interval data are numeric with no zero (absence of quantity) defined (Celsius temperature). Ratio data are numeric with a defined zero (Kelvin temperature). These apply to any kind of statistical data.

The distinction between aggregate and unit data is based on the definition of the variables in the dataset. A dataset can contain both unit and aggregate data.

Access restrictions on data (e.g., public, restricted, private) are assigned by the business and can change over the life-cycle of the dataset.

The domain for a dataset is defined by the subject field that data apply to. However, some datasets are merged from others, so a merged set can have the combination of its constituents. There seems to be no restriction on the number of subject fields.

Mode of transmission is not definitional for a dataset, as a single dataset can be obtained multiple ways. The phases of GSBPM may not be useful, as a single dataset can pass through a phase without change. Further, the phases impose a usage criterion (data for collection; data for editing; etc.) that seems arbitrary and would be useless in another domain (outside statistics).

Similarly, the explorative, temporary, and organizational categorization is based on intent, rather than the data per se. Plus, the categorization could change without any change to the data. If we change the organizational structure described above (rectangular, etc.), then we should call that a new dataset.

FranckCo commented 3 years ago

Decided at the May 25 meeting:

create four sub-classes of coos:StatisticalDataset corresponding to the types listed by Dan above, say coos:RectangularDataset, coos:EventHistoryDataset, coos:DimensionalDataset and coos:KeyValueDataset.
all other categorizations should be rendered by properties with enumerated ranges (concept schemes)

FranckCo commented 3 years ago

Remaining questions:

should we create a Metadataset sub-class of coos:StatisticalDataset?
Where would graph data (RDF, property graph) go in the typology chosen?

flo7894 commented 3 years ago

Current GSIM model distinguishes between Data Set and Referential Metadata Set, both being sub-classes of Information Set. A metadataset sub-class of coos:StatisticalDataset might not be fully consistent with GSIM then.

flo7894 commented 3 years ago

Considering rdf data as triples, I think they would be best rendered by the long format (event history). The subject of the triple would be the identifier component, the predicate would be the variable descriptor component, the object would be the variable value component. If needed the named graph of the triple could be stored as an attribute component.

In the currently available documents for CDI-DDI reviews, classes are defined as follow (see Part_2_DDI-CDI_Detailed_Model_PR_1.pdf) : Wide Data: Traditional rectangular unit record data sets. Each record has a unit identifier and a set of measures for the same unit. Long Data: Each record has a unit identifier and a set of measures but there may be multiple records for any given unit. The structure is used for many different data types, for example event data and spell data. Multi-Dimensional Data: Data in which observations are identified using a set of dimensions. Examples are multi-dimensional cubes and time series. (Note that support is provided for time-series-specific constructs to support some legacy systems which are not based around the manipulation of multi-dimensional data “cubes”.) Key-Value Data: A set of measures, each paired with an identifier, suited to describing No SQL and Big Data systems. Do we agree to use those as definitions for coos classes ?

I don't know if the names of the classes WideDataSet, LongDataSet will stay the same in the final specification of cdi-ddi or if they will be replaced by RectangularDataset, EventHistoryDataset. Either way shouldn't we name coos classes the same way cdi-ddi does ?

FlavioRizzolo commented 3 years ago

Note that in the DDI-CDI model the actual name of the third type of data is Dimensional. We use Multi-Dimensional only informally in the documentation because people relate to that.

How are we introducing these concepts in the ontology? As data or as datasets?

In either case, I think the definitions need some work, they kind of look like explanatory text to me for the most part. Perhaps we could start all with something like "organized collection of data in which..." and then provide the characterization. That would be in line with the way they are all defined in GSIM and DDI-CDI.

That brings us to the question, I think, of what to do with the definitions when the classes already exists and are defined in one of the base models we are integrating. Do re-write them here or use them as-is from the source?

dgillman4909 commented 3 years ago

Florian, et al,

First, I’m happy using DDI-CDI class names if that is the wish of the group. I doubt the names of the already defined structure types will change.

I put a comment in my entry in the Doodle poll, and it got lost. I will try to repeat it here:

The problem I see with creating a new metadata set type is that you can’t distinguish a set of metadata by the set’s structure. The types I proposed are all based on structure. Whether data are metadata is determined by role (the idea metadata are used to describe), not data set structure. However, we may want to add to the list of allowable structures to include some other ways metadata are organized: unstructured (for text files) and hierarchical (for graphs, XML, RDF, and JSON).

I hope that helps.

As I tried to make clear in my comment in GitHub, there is no one way to organize a data set. One could organize an RDF graph in the long format. The model for a triple translates into the long format pretty well. However, you’d need an obvious way to state that the subject of one triple was the object of another. I’m not sure relying on a URI is sufficient, because URI’s don’t have to be unique.

As for the definitions, I want to see “and described” added after “identified” in the definition of Multi-Dimensional Data. Otherwise, yes. The definitions are fine.

The idea the dimensions identify a cell comes from SDMX (I think), and it is a seductive one. However, ultimately, I think it overloads what the dimensions are doing for each cell. Fundamentally, they describe the cell. An n-cube is based on a unit of analysis, or unit type, and the universe for each cell is that unit type specialized by the dimensions assigned to that cell. In more mathematical terms, it restricts the domain of the measure represented in the n-cube to that cell.

Those are two slightly different interpretations, but neither invokes identification for a cell. In particular, the terminological approach doesn’t depend on the existence of a measure. I contend that it is useful to define an n-cube without knowing a measure in advance. Because, one can use the same dimensions to stratify many measures.

Yours Dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/linked-statistics/COOS/issues/15#issuecomment-849472732, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIVKL3AVKWLLMUL6IZ2TMDTPYEGDANCNFSM4HVB5EUA.

egreising commented 3 years ago

I apologize if this comment is out-of-time, but I have two questions after reading carefully all the comments:

I don't think that these four sub-classes of coos:StatisticalDataset corresponding to the types listed by Dan above, say coos:RectangularDataset, coos:EventHistoryDataset, coos:DimensionalDataset and coos:KeyValueDataset will be enough. What happens with other formats that can be found in the statistical domain like "Transposed", "Unstructured" or maybe "Blockchain" in the future. Is it possible to add sub-classes?
I think coos:StatisticalMetadataset should be a class at the same level than coos:StatisticalDataset

dgillman4909 commented 3 years ago

Edgardo,

I thought I made clear we can add to my original list of 4. But, transposed is probably similar to or maybe the same as event history. We’d have to dig into some examples to see.

As for StatisticalMetadataSet, I argued against that in a subsequent comment or email, specifically because there is no structure implied by simply being metadata. Further, I noted a couple of possible new structures related to metadata to add, and one was unstructured and the other graph or hierarchical.

I don’t understand “blockchain” as a structure. It is entirely possible I don’t understand blockchain well enough to know. Can you explain what it is about this that requires a new structure and what that looks like?

Yours, Dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/linked-statistics/COOS/issues/15#issuecomment-853857563, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIVKL2PMXTARG2ZUM4TJMDTQ55QLANCNFSM4HVB5EUA.

egreising commented 3 years ago

Dan,

Thank you for your reply. Transposed is an old format that certain statistical datawarehouses used to implement to make it very efficient I/O by minimizing data transfer. The data is stored in multiple binary files, one for each variable with all the values in the same order. The nth value of each file compose a unit record. The structure can be complemented with a B-tree structure for indexing each unit. An example of a statistical product using such format is REDATAM.

I don't know enough about "blockchain", just that it is a back-linked list of "transactions" with a header containing metadata. I don't know if it will ever been used for statistical processes, but it could be. As far as I know, there are many ways of implementing it, using databases or even flat files, which makes it not different from the "Rectangular" or "Dimensional" types. In my understanding, it is not the data support format what differentiates the types, but the way the information is organized, and blockchain is different from a rectangular or dimensional dataset.

Regarding StatisticalMetadataSet, I don't fully understand your point on "there is no structure implied by simply being metadata". If you use a DDI-C template for reference metadata, these metadata sets have a structure. Similarly, when you exchange reference metadata in SDMX there is always an MSD that defines the structure of the metadata set. And what is more important for me, is that these structures are different from and independent of the data structures. That's why I think that StatiticalMetadataSet is a different class.

Best, Edgardo

dgillman4909 commented 3 years ago

Edgardo,

From your description of the transposed format, it is exactly what DDI-CDI means by event history data, and it is what I meant here.

Sounds like blockchain uses some format based on JSON, XML, RDF, or some other structure. We might not have to do anything. If there’s a blockchain standard that the industry has adopted with some new format, we might have to take a look.

Yes, the DDI’s (each member of the family) and SDMX specify formats for managing the metadata they define. But, those same structure types (hierarchy, graph?) could be used to organize data of many other kinds and applications. There is nothing inherent in the fact that the DDI’s and SDMX manage metadata that these structures exist. This is what I meant by “there is no structure implied by simply being metadata”. The fact some data are metadata does not, in itself, imply a structure.

Graph structures are typically used to organize metadata, and are used by the DDI’s and SDMX. But, with the metadata defined by the Dublin Core, the rectangular structure applies. Graph structures are increasingly being used to organize data in many domains, and rectangular structures have been used forever to organize some data. Therefore, a graph is not inherent to metadata.

If we define subtypes of a data set based on structure, a metadata set is not a structure type of its own.

Yours, Dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/linked-statistics/COOS/issues/15#issuecomment-854497205, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIVKL7RDR6Z7ZWJH5YH6SLTRCIPNANCNFSM4HVB5EUA.

FlavioRizzolo commented 3 years ago

Another side to this discussion is to carefully look at the DDI-CDI and GSIM models to do a mapping. CDI Wide Data Structure is not the same as GSIM Unit Data Structure: the former is uniform, in the sense that each row has the same components/columns, whereas the latter is heterogenous, in the sense that each row might be associated to a different logical record. Just to keep in mind.

FranckCo commented 3 years ago

Ad hoc meeting was held on June 30th, Florian updated the ontology accordingly (see commit 217590e07e3944300bcfb141c01064c781266afc):

EventHistoryDataset renamed TransposedDataset
GraphDataset added (definition is missing)
no specific subtype for metadata: a property could be used
translations are missing
remaining question on mapping StatisticalDataset to GSIM (Information Set)

FranckCo commented 3 years ago

Regarding the property indicating that a dataset contains metadata, we could have a simple boolean "isMetadata" property or an object property like "metadataFor" whose domain could be the union of prov:Entity and prov:Activity (for process metadata). For metadata not attached to a particular process or entity (e.g. a statistical classification), the value of "metadataFor" could just be the "Official Statistics" individual.

FranckCo commented 3 years ago

Current state of things regarding the "products" domain: coos-prod-ds

FranckCo commented 3 years ago

Actually, limiting the range of the property might prevent some use cases (e.g. metadata on a prov:SoftwareAgent), so it is preferable to let the range open.

linked-statistics / COOS

Sub-classes of StatisticalDataset #15

Conclusion and suggestion: