sdmx-twg / vtl

This repository is used for maintaining the SDMX-VTL specification
11 stars 7 forks source link

Addition of Data Points #272

Closed stratosn closed 1 year ago

stratosn commented 7 years ago
reporter issue reference document (UM/RM/EBNF) page line
BDI-10 RM new
BDI-11 RM new

Issue Description

(BDI-10) In VTL it is impossible to obtain a result Data Set by adding new Data Points to the extensional form of the operand data set . (BDI-11) In VTL it is impossible to obtain a result Data Set by changing some Data Points of the extensional form of the operand Data Set. In particular, it is impossible to change the measures in a simple way.

Proposed Solution

(BDI-10) Add an INSERT-like operation which produces a result Data Set by copying the Data Points of the Input Dataset and adding some new Data Points. Following the general rules already described above, it should be possible to specify the new Data Points in many ways, either in-line, or by defining them through the VTL-ML, or accessing externally stored Data Points/Data Sets. (BDI-11) Add an UPDATE-WHERE-like operation produces a result Data Set by copying the Data Points of the Input Dataset and updating some of them. It should also be possible to Replace Data Points, by identifying the Data Points to be replaced through their identifiers and by specifying the replacing Data Points (with the same identifiers). As described in the previous points, following the general rules already described, it should be possible to specify the Data Points (to be replaced or replacing) in many ways, either in-line, or by defining them through the VTL-ML, or accessing externally stored Data Points/Data Sets.

bellomarini commented 7 years ago

Proposal:

An operator to create a Dataset composed of a single data point.

Creating one-datapoint data set

DS1 := sDataset(value1 in ValueDomainSubset, ..., valueN in ValueDomainSubset)

Example:

DS1 := sDataset(1 in Companies, 2 in Countries, 23 in Numeric)

Use the union to apply the insert

DS3 := union(DS1, DS2) 

Observe that with this approach the absence of duplicates is guaranteed (or solved within the union) and existing Datasets are never modified, but new ones are created.

An operator, UPDATE to virtually modify the content of a Dataset. In the syntax below, the data points of ds that match with those of ds2 (they must have the same dimensions) have the measures modified, by copying those of ds2. The data points of ds that do not match with any data point of ds2 are not modified. In general ds and s2 are expressions, so the modification can be even calculated.

ds3 := update(ds, ds2)

Examples:

It updates the content of ds adding 1 to it and storing the result into ds3. In this case, all the Data points of ds match with those of ds+1, so all of them will be updated.

ds3 := update(ds, ds+1)

The following example, updates only the Datapoints where the component x evaluates to 1 and adds the content of ds2 to it.

ds3 := update(ds, ds[filter x=1]+ds2)
capacma commented 7 years ago

In addition to the update operator proposed above, I propose to have an update_cube operator as described below.

Like the update proposed above, update_cube returns a new dataset, without duplicates (by definition). The update_cube is useful when the user wants to calculate a subset of the dataset (specified by one or more dimension values) without specifying whether to execute an SQL insert, delete or update operation. The user just tells VTL to "replace" the data contained in a subset of the dataset with an expression.

Note: in the examples below, I am using the brackets [ ] for the subscript and curly braces { } for the clause.

The syntax is:

update_cube DatasetName [ Subscript ] = Expression { , DatasetName [ Subscript ] = Expression } *

Example 1:

ds3 := update_cube ds [ ref_area = BLEU ] = ds [ ref_area = BE ] + ds [ ref_area = LU ]

the behaviour of the above statement is equivalent to the following VTL 1.1 code:

ds1 := ds [ ref_area = BE ] + ds [ ref_area = LU ]
ds2 := ds1 { identifier ref_area = BLEU }
ds3 := union ( ds2 , ds { filter not ( ref_area = BLEU ) } )

The SQL code of the above example looks like: select "BLEU" as ref_area, ... from ( select ... from ds1 join ds2 ... ) union select * from ds where ref_area <> "BLEU"

Example 2:

ds3 := update_cube ds [ ref_area = BLEU, time = 2000 ] = ds [ ref_area = BE, time = 2000 ] + ds [ ref_area = LU, time = 2000 ] )

the behaviour of this statement is as follows:

ds1 := ds [ ref_area = BE, time = 2000 ] + ds [ ref_area = LU, time = 2000 ]
ds2 := ds1 { identifier ref_area = BLEU, time = 2000 }
ds3 := union ( ds2, ds { filter not ( ref_area = BLEU and time = 2000 ) } )

According to the syntax propsoed above, it is possible to specify multiple updates in the update_cube operator. In this way we can write more compact code, and we can avoid creating a new (virtual) dataset for each calculation. The name of the "input" dataset (ds in the example below) is always the same therefore simplifying the program readability. Note: the order of evaluation of the different updates is not specified explicitly, VTL can choose any order, provided that each calculation uses the data resulting from the previous calculations.

Example with multiple updates:

ds := get (" some dataset " ) ds1 := update_cube ds [ ref_area = BLEU ] = ds [ ref_area = BE ] + ds [ ref_area = LU ] , ds [ time_period = 2000 ] = ds [ time_period = 2000S1 ] + ds [ time_period = 2000S2 ] put ( ds1, " some dataset " )

is equivalent to:

ds := get (" some dataset " ) ds1 := update_cube ds [ ref_area = BLEU ] = ds [ ref_area = BE ] + ds [ ref_area = LU ] ds2 := update_cube ds1 [ time_period = 2000 ] = ds1 [ time_period = 2000S1 ] + ds1 [ time_period = 2000S2 ] put ( ds2, " some dataset " )

Note: we may also combine update and update_cube in a single syntax.

bellomarini commented 7 years ago

This issue is considered together with two others (#275, #335), which are closed.

egreising commented 7 years ago

I don't see opposition between @bellomarini and @capacma proposals. @bellomarini 's is about inserting and updating the information in a dataset, while @capacma 's has to do with "addressing" of the components in a dataset. Literally, what @capacma proposes is to make the "filter" clause implicit in this type of commands allowing to address components by just specifying the keys. Last example in @bellomarini 's proposal "Updates only the Datapoints where the component x evaluates to 1 and adds the content of ds2 to it."

ds3 := update(ds, ds[filter x=1]+ds2)

would be expressed as

ds3 := update(ds, ds[x=1]+ds2)

following @capacma 's idea.

I don't see the need of calling these functions in a different way; they can be just "update" and have explained the "freedom" of specifying filtering with the "filter" clause implicit. If agreed, this addressing syntax should be extended to any operator that can refer to a component or set of components in a dataset.

capacma commented 7 years ago

As mentioned by @egreising the syntax already exists at page 121 of the RM (subscript). There is a slightly difference between a subscript and a filter, however this fact does not change the idea behind update and filtering (a subscript specifies a single value for a dimension, and eliminates the dimension from the resulting dataset). The main differences between the two proposals is the following: in the @capacma proposal the insert and update operations are carried out without specifying explicitly whether to execute an insert of new data points or an update of existing data points but just "replacing" the data points contained in the subscript.