sdmx-twg / vtl

This repository is used for maintaining the SDMX-VTL specification
9 stars 8 forks source link

Consistency between Information Model and Language Fundamentals #34

Closed stratosn closed 11 months ago

stratosn commented 7 years ago
reporter issue reference document (UM/RM/EBNF) page line
BDI-7 UM/RM 44 All
BDI-1 UM/RM 44 All
BDI-3 UM/RM 44 All

Issue Description

(BDI-7) The specification lacks a clear explanation about the management of constants. Some operators (math operators, for example) contain an ad hoc strategy, but a general framework is missing. Moreover, the compatibility between the scalar expressions expressed in the core and the constants as a special case of datasets having one data point and one measure is not clarified. Design a coherent model to handle constants accordingly. Possibly they should be dealt with as first-class objects and not as special datasets. However, their correspondence in terms of datasets should be clarified.

(BDI-1) The language used in the first sections, until the “VTL Information Model” (included) and the language used in the following sections, from “Language Fundamentals” onward, need to be aligned, because different terms are used to express the same concept. For example: Until the VTL IM -- from Language Fundamentals Artefacts<--> Objects, Types of artefacts <--> Types of objects, Code Item<-->Scalar value, Set <--> Set, ??? (not existing in the IM) <--> Collection, Data Point <--> Record, Data Structure types <--> Product types… and so on … The former and the latter parts of the User Manual should be aligned possibly by using the same terms when the concept they refer to is the same. Because the terms of the IM are aligned as much as possible with the GSIM terminology, it is suggested to change the terms of the latter part (from language fundamentals onwards) in order to cope with the GSIM terminology as well.

(BDI-3) To guarantee the property of “closure” of the language, the VTL should operate only on artefacts of its own IM, instead there are aspects of the language that are not part of the IM, for example: VTL basic data types (numeric, string, boolean …), Lists and Collections,… … … possible others to be detected … … …

Proposed Solution

(BDI-7) Design a coherent model to handle constants accordingly. Possibly they should be dealt with as first-class objects and not as special datasets. However, their correspondence in terms of datasets should be clarified.

(BDI-1) The former and the latter parts of the User Manual should be aligned possibly by using the same terms when the concept they refer to is the same. Because the terms of the IM are aligned as much as possible with the GSIM terminology, it is suggested to change the terms of the latter part (from language fundamentals onwards) in order to cope with the GSIM terminology as well.

(BDI-3) The documentation should be inspected carefully to detect all of these cases. Once found, either the IM is upgraded by adding the missing artefacts or these artefacts are eliminated. If some Operator of the language uses these kinds of artefacts, they should be added to the IM, otherwise they can be eliminated. For example, the basic VTL data types can be considered as Value Domains that exists by default, even if no user defines them explicitly. This way the scalar values are Code Items of some Value Domains. As for Lists and Collections, it should be verified if they are used by some operators. If so, they should be added to the IM. This comment is related to the previous ones.

bellomarini commented 7 years ago

In addition to the solutions BDI-7, BDI-1, BDI-3,

I suggest the introduction of an autoboxing assumption such that the Datasets having a only one Component (a Measure Component, in particular), can be considered as scalar values of the same type as the Measure Component itself.

Let us call such Datasets singleton. I also assume that a singleton can contain at most one Datapoint, as reasonable, since it does not have identifier components.

Thus, everywhere a literal scalar value (or a variable parameter for it, or in general a component expression) is accepted in VTL syntax, a singleton data set variable parameter can be used as well.

Example:

V1 := D1[keep M1]
D2 := D3[calc M2 * V1 + 2 as M3]
VincDV commented 7 years ago

This comment does not refer to the whole Language Fundamentals section but only to the sub-section Objects and Types. Other comments about the other sections will follow.

Comparing the terminology used in Objects and Types and the Information Model, I would say that:

An “object” is an object of Transformations, in other words an identifiable artefact of the VTL IM that can be operand or result of a Transformation.

Not all the identifiable artefacts can be objects, for example some artefacts of the generic model of Transformations (e.g. the “Operators”, the “Transformations” themselves) are not objects because they cannot be operand or result of Transformations.

Any VTL object has a type,

The “Scalar” type is the generic type of the single values of the Value Domains (i.e. the Values/Code Items) and of the Value Domain Subsets

Each data type/Value Domain identifies the values it may contain and the possible operations on them (like +, -, *, /, power … for numbers).

The VTL has 4 basic default scalar types/Value Domains: Number, Boolean, Date, String (I would avoid distinguishing Integer and Float in order to allow these two types to be combined by the same operators); this means also that the VTL assumes that these 4 basic Value Domains exist by default, even without defining them explicitly . Correspondingly, the VTL assumes that 4 basic Variables exist by default, one for each basic Value Domain, namely “NumVar”, “BoolVar”, “TimeVar”, “TextVar” (see also my comment on the general behaviour for the measures Component in the Item #6).

For each basic Value Domain, proper scalar operations are possible. For example:

Please note again that each Value Domain identifies the values it may contain and the possible operations on them (i.e. its algebraic structure). As a consequence, the numeric Code Items of some enumerated Value Domains, like for example 004 – Afghanistan, 008 – Albania …. (ISO3166-1 numeric coding for countries) do not belong to the default type Number, because it make no sense to compose such numeric values with the usual mathematical operations.

In addition to the basic Value Domains, the VTL allows user defined scalar types/Value Domains (this would be the case of the ISO3166-1 above). User defined Value Domains do not inherit the algebraic structure (i.e. possible operations) from the basic Value Domains and may have either no algebraic structure (no operations possible) or some other algebraic structure (e.g. see the section “relations and operations between Code Items”, which defines scalar operations which compose the Code Items of some user defined Value Domain to obtain other Code Items of the same Value Domain). Each user defined Value Domain corresponds to a different (user defined) data type.

VTL allows defining Value Domain Subsets (or simply SETS) of the default Value Domains above, examples:

these sub-types/Sets obviously may inherit the algebraic structure (i.e. the possible operations) of the corresponding basic Value Domain.

The representation format is a concept different from the type.

Each scalar type/Value Domain (both default or user defined ones) has a representation format (the SDMX facet).

The representation format of the basic data types/Value domains is the following:

the pages 95-96 of the user manual should be moved in this section for the representation of the date, the ISO format should be recalled, as decided in a previous meeting]

The representation format of a user defined Value Domain derives from the representation format of one of the basic Value Domains, distinguishing the Numbers in two categories (Integers and Floating Numbers) and restricting the representation if needed, examples:

The representation format of a Value Domain is the format in which its Values are represented and may be not the same as the data type of the Value Domain. For example, for ISO3166-1 numeric coding for countries ( 004 – Afghanistan, 008 – Albania …. ), the data type would be a user defined data type corresponding to the specific ISO3166-1 Value Domain, while its representation would be integer [3]. The fact that the representation format is integer does not mean that numerical operations are possible.

The representation of a Value Domain is inherited by its (sub)Sets

I wonder if the representation formats should be explained in the User Manual, being an imlementation aspect more that a logical aspect. This is a consideration which applies in general, I think it would be much better to avoid in the User Manual technical information which requires a programmer skill, moving it in a specialized section.

I don't think the VTL need the type “List”, it does not exist in the IM I don't think the VTL need the type "Collection”, it does not exist in the IM as well.

The “Dataset type” correspond to the Data Structure of the Data Sets, in fact for Data Sets having the same Data Structure the same operations are allowed, I suggest to rename it as “Data Structure Type”.

I would eliminate the “Record” type (which would correspond to the Data Point in the IM and would have the same data structure as the Data Set it belongs to), The Data Point is a particular case of Data Set, namely a subset containing just one Data Point, among a number of possible other subsets.

I would eliminate "Product Type", which would correspond to a multidimensional Value Domain in the IM, I think that this detail should be avoided unless there is some concrete reason for which we need it.

I would eliminate the "Function" type, because a Function correspond to an Operator in the IM and cannot be Transformed (operators cannot be operands or results of the VTL-ML Transformations).

dragan-ivanovic commented 7 years ago

Dear Vincenzo,

Thank you very much for your comments. I do not agree with all of the points you raised, but I believe that the design of the type system is an important issue that needs to be discussed very carefully from different angles.

My main point of disagreement is the following: the type system needs to assign type to any kind of object that figures in the program. For instance, if we have core functions and if we can define new functions (in fact, the entire standard library should be constructed in that way), and if we can pass functions as parameters (as foreseen in many cases), then we simply must have a type for functions. In compiling / interpreting f(x), the compiler must know that an object associated with name f is a function (not a number, dataset or a validation rule), it needs to know what kind of arguments that function takes and what kind of results it returns. If f expects a dataset and x is a number, the compiler uses this information to signal an error. That is nothing more nor less than having a function type in the type system. The same goes for the collections (lists and sets), products (tuples) and data structures.

I do agree that a function type is not something that should go into IM: as you say, it is not transformed by VTL programs. However, forcing the programmers to work only with things that represent inputs or outputs could make VTL unattractive in the face of competition. I believe that limiting the language only to the kinds of objects that exist in the IM would deprive it of the essential toolkit used by many standard programming techniques and would make it difficult to implement more sophisticated validations.

We should not forget that we are not only trying to patch-up VTL 1.1 so that it makes sense, but we need to sell it to the international statistical community as something arguably better than the proven off-the-shelf solutions they already use, such as R and SAS.

An analogy can be made with the most of Unix command line utilities which are filters that read something from the standard input, process it, and write the result to the standard output. That is the only thing the user can see. The Unix "file IM" is much simpler than VTL IM: the only thing is the concept of file as a stream of bytes. However, the "classical" Unix utilities are written in C and extensively use functions (C also has function types), data structures (equivalent to module types) and abstract data types such as collections and index trees to perform their tasks. Otherwise, utilities such as sort and word count would be very difficult, almost impossible to write just based on working with individual bytes or byte arrays. One can, of course, think of a language where the only types are bytes and byte arrays (i.e., only those things that appear in the "file IM"), but nobody would like to use it if they could choose something else.

We can discuss this, of course, in more detail.

I have added today a document named "type-sys.pdf" in comments to the GitHub issue #283, with many more details and examples.

Best regards,

Dragan

On Mon, 06 Feb 2017 12:00:49 -0800 VincDV notifications@github.com wrote:

This comment does not refer to the whole Language Fundamentals section but only to the sub-section Objects and Types. Other comments about the other sections will follow.

Comparing the terminology used in Objects and Types and the Information Model, I would say that:

An “object” is an object of Transformations, in other words an identifiable artefact of the VTL IM that can be operand or result of a Transformation.

Not all the identifiable artefacts can be objects, for example some artefacts of the generic model of Transformations (e.g. the “Operators”, the “Transformations” themselves) are not objects because they cannot be operand or result of Transformations.

Any VTL object has a type,

The “Scalar” type is the generic type of the single values of the Value Domains (i.e. the Values/Code Items) and of the Value Domain Subsets

Each data type/Value Domain identifies the values it may contain and the possible operations on them (like +, -, *, /, power … for numbers). The VTL has 4 basic default scalar types/Value Domains: Number, Boolean, Date, String (I would avoid distinguishing Integer and Float in order to allow these two types to be combined by the same operators); this means also that the VTL assumes that these 4 basic Value Domains exist by default, even without defining them explicitly . Correspondingly, the VTL assumes that 4 basic Variables exist by default, one for each basic Value Domain, namely “NumVar”, “BoolVar”, “TimeVar”, “TextVar” (see also my comment on the general behaviour for the measures Component in the Item #6).

For each basic Value Domain, proper scalar operations are possible. For example:

  • for the Number Value Domain: mathematical operations (sum, subtraction, product, difference, power and so on),
  • for the Boolean Value Domain: logical operations (and, or, not …)
  • for the Date Value Domain: time operations (lag …)
  • for the String Value Domain: string operations (substring, concatenation …)

Please note again that each Value Domain identifies the values it may contain and the possible operations on them (i.e. its algebraic structure). As a consequence, the numeric Code Items of some enumerated Value Domains, like for example 004 – Afghanistan, 008 – Albania …. (ISO3166-1 numeric coding for countries) do not belong to the default type Number, because it make no sense to compose such numeric values with the usual mathematical operations.

In addition to the basic Value Domains, the VTL allows user defined scalar types/Value Domains (this would be the case of the ISO3166-1 above). User defined Value Domains do not inherit the algebraic structure (i.e. possible operations) from the basic Value Domains and may have either no algebraic structure (no operations possible) or some other algebraic structure (e.g. see the section “relations and operations between Code Items”, which defines scalar operations which compose the Code Items of some user defined Value Domain to obtain other Code Items of the same Value Domain). Each user defined Value Domain corresponds to a different (user defined) data type.

VTL allows defining Value Domain Subsets (or simply SETS) of the default Value Domains above, examples:

  • number [a:b] -- any number that falls between two constants a and b, both inclusive (where a<b).
  • number {x1, ..., xn} -- one of the numbers enumerated in {x1, ..., xn}
  • string [a:b] -- any string consisting of between a and b characters
  • string {s1, ..., sn} -- one of strings enumerated in {s1, ..., sn}; and so on …

these sub-types/Sets obviously may inherit the algebraic structure (i.e. the possible operations) of the corresponding basic Value Domain.

The representation format is a concept different from the type.

Each scalar type/Value Domain (both default or user defined ones) has a representation format (the SDMX facet).

The representation format of the basic data types/Value domains is the following:

the pages 95-96 of the user manual should be moved in this section for the representation of the date, the ISO format should be recalled, as decided in a previous meeting]

The representation format of a user defined Value Domain derives from the representation format of one of the basic Value Domains, distinguishing the Numbers in two categories (Integers and Floating Numbers) and restricting the representation if needed, examples:

  • integer [a:b] -- any integer that falls between two integer constants a and b, both inclusive (where a<b).
  • integer {x1, ..., xn} -- one of integers enumerated in {x1, ..., xn}
  • string [a:b] -- any string consisting of between a and b characters
  • string {s1, ..., sn} -- one of strings enumerated in {s1, ..., sn}; in effect this type describes elements of a code list.

The representation format of a Value Domain is the format in which its Values are represented and may be not the same as the data type of the Value Domain. For example, for ISO3166-1 numeric coding for countries ( 004 – Afghanistan, 008 – Albania …. ), the data type would be a user defined data type corresponding to the specific ISO3166-1 Value Domain, while its representation would be integer [3]. The fact that the representation format is integer does not mean that numerical operations are possible.

The representation of a Value Domain is inherited by its (sub)Sets

I wonder if the representation formats should be explained in the User Manual, being an imlementation aspect more that a logical aspect. This is a consideration which applies in general, I think it would be much better to avoid in the User Manual technical information which requires a programmer skill, moving it in a specialized section.

I don't think the VTL need the type “List”, it does not exist in the IM I don't think the VTL need the type "Collection”, it does not exist in the IM as well.

The “Dataset type” correspond to the Data Structure of the Data Sets, in fact for Data Sets having the same Data Structure the same operations are allowed, I suggest to rename it as “Data Structure Type”.

I would eliminate the “Record” type (which would correspond to the Data Point in the IM and would have the same data structure as the Data Set it belongs to), The Data Point is a particular case of Data Set, namely a subset containing just one Data Point, among a number of possible other subsets. I would eliminate "Product Type", which would correspond to a multidimensional Value Domain in the IM, I think that this detail should be avoided unless there is some concrete reason for which we need it.

I would eliminate the "Function" type, because a Function correspond to an Operator in the IM and cannot be Transformed (operators cannot be operands or results of the VTL-ML Transformations).

VincDV commented 7 years ago

Dear Dragan,

Yes, I agree that we need to discuss carefully these topics and that there are many things to discuss. Moreover I see your points.

As for the functions, I'm not meaning that they should not go into the IM. On the contrary, I think that they are already in the IM, only called with a different name: in my understanding, a function is what in the IM is called Operator. Because "function" is a very generic term, also applicable to the Data Sets (the Data Set is a function having dependent and independent variables ...), I would prefer to use the term Operator (explaining better that this term means "transformation function" and describing the different kinds of operators, namely "core", "standard library" and "user defined" operators).

Moreover, considering your main point, I'm not against having the type "Operator" if we give a coherent definition of what is a "VTL object" in term of IM, in order to include Operators too.

If we need the type "Operator", I wonder if we need also the type "expression"/"Transformation". Do we?

More in general, I see that the artefacts of the IM are also types of the typing system, therefore in my opinion we should use the same names, to mean that we are referring to the same things and avoid misunderstandings.

On the other side, it is very difficult to me to accept the idea that certain types of the typing system are not artefacts of the IM, because by definition a language is strictly connected to its IM and can manipulate only the IM artefacts (this is the basic property of closure). If a language can manipulate also something else, it means that the IM and the language are incoherent.

For IT procedural programming languages (C, R, SAS ...) this aspect of coherency does not appear because the IM of these languages is not described separately from their type system (like in the VTL), so that simply their type system is their IM. This aspect appears instead in the SQL, which is based on the Relational IM and is declarative (like VTL), and coherently SQL does not use types that are not in the IM.

Moreover I would not put the VTL into competition with the IT procedural programming languages and try to convince the international statistical community to substitute the VTL to the proven off-the-shelf solutions they already use. The VTL is not born for this purpose and I don't think we have any hope to do so, for many reasons. VTL and IT languages are on two different planes, the VTL is for statisticians, with the main goal of sharing validation and transformation rules by using a standard declarative language and documenting the relationships between operands and results, while the IT procedural languages are for IT people and for IT implementations and can be used even for implementing VTL rules, so that the VTL is not aimed to substitute any existing IT language. The strenght of the VTL is based on its IM, which is derived by statistical notions and oriented to statisticians, differently from any IT language. Not necessarily we need in VTL all the features (and the types) of a procedural IT language, rather it may also happen that some feature (and type) is counter-productive (like the non-declarative ones). Nevertheless, the features of the IT languages remain available for the IT implementations,

linardian commented 11 months ago

Vincenzo's last comment is taken as valid.