sdmx-twg / vtl

This repository is used for maintaining the SDMX-VTL specification
11 stars 7 forks source link

New update statement. #275

Closed stratosn closed 1 year ago

stratosn commented 7 years ago
reporter issue reference document (UM/RM/EBNF) page line
MC-51 RM new update

Issue Description

The update statement is proposed to replace the current put operator that is used to update a persistent dataset.

Proposed Solution

New update statement. update ds1 with ds2 where ds1 is an updatable dataset: dataset_name { [ subscript ] } { { filter_clause } }

Example 1: the following statement deletes the datapoints in ds with ref_area "BLEU" and insert the data points resulting from the expression: "ds [ ref_area = "BE" ] + ds [ ref_area = "LU" ]" :+1: update ds [ ref_area = "BLEU" ] with ds [ ref_area = "BE" ] + ds [ ref_area = "LU" ]

Example 2: to delete all datapoints where ref_area=BLEU: update ds [ ref_area = "BLEU" ] with empty the keyword empty is a special keyword that instructs VTL to replace the data with an empty set of data (i.e. simply delete them). Example 3: to delete all datapoints from ds: update ds with empty

(i.e. simply delete them). update

See also alternative at #335

bellomarini commented 7 years ago

See also an alternative approach #272

capacma commented 7 years ago

A generic use case for an update statement. Suppose the user wants to derive an indicator (e.g. indic = BOP_GDP, total BOP value of a country divided by the gdp of the country) based on existing data (e.g. ds_bop and ds_gdp ). A hierarchical rule is not appropriate to compute the indicator because the derivation involves 2 datasets (too complex: create the hierarchical ruleset, include a join etc.) An efficient way to derive the indicator is to use an update statement (as described above) that uses the multidimensional model offered by VTL. With the update statement the indicator is derived in a relatively simple way (assuming that ds_bop and ds_gdp have some dimensions in common e.g. ref_area and time_period): ds_bop [ indic = BOP_GDP ] = ds_bop [ indic = BOP_TOTAL ] / ds_gdp { filter obs_value <> 0 } Of course we should consider also the alternatives described in #335 and #272.

vignola commented 7 years ago

only few comments: 1) We should be careful on the duplication of datapoints. In this case: update ds [ ref_area = "BLEU" ] with ds [ ref_area = "BE" ] + ds [ ref_area = "LU" ] if I have well understood the operator we will have a duplication of records 2) I don' t see in the rule: ds_bop [ indic = BOP_GDP ] = ds_bop [ indic = BOP_TOTAL ] / ds_gdp { filter obs_value <> 0 } the use of update. 3) I will avoid to introduce another keyword "empty": update ds [ ref_area = "BLEU" ] with empty this could be siply done witha filter or using the proposed operator "remove": remove (ref_area = "BLEU") from ds

VincDV commented 7 years ago

I'm very sorry, the proposed solution is not possible in VTL. The VTL is a functional language dealing with immutable objects, as described in the VTL language fundamentals and in the VTL Information Model, therefore the Data Sets are not updatable at all. Moreover the VTL is acyclic, and cannot allow updates because they generate cycles and consequently impredictable results. Therefore a fundamental VTL constraint is that a Data Set can be made persistent just once.

In other words, taking one generic expression, when the Data Points of an input Data Set change (e.g. because new Data Points relevant to the input Data Set are collected), the VTL expressions must be re-executed in order to produce a new version of the result Data Set. In its turn, if the result dataset is input of other expressions, these should be re-executed as well.

The VTL mechanism is the same as the spreadsheets: there are not Operators to update the value of the cells, just Operators to calculate the cell (and each cell may have no more that one calculation formula).

It is up to the spreadsheet implementation to undestand when the calculations of a cell needs to be executed (or re-executed), as well as it is up to the VTL implementation to understand when the calculation of a Data Set needs to be executed (or re-executed).

Moreover, considering that the VTL deals with objects more complex than a single cell (i.e. the Data Sets), it may happen that only a small part of the input Data Set change (for example, a new reference date is collected while all the previous ones remain unchanged). Even in this case, it is up to the VTL implementation to assess which part of the input dataset is changed and consequently which part of the output dataset needs to be re-calculated. The updates obviously happen, but at the IT implementation level, not at the VTL level.

The advantage of the VTL approach is that the users, in defining the calculation expressions, need just to define the algorithm for producing the result and do not need to take care of all the processes that govern the changes. This also according to the user-orientation, which is a main principle of the VTL.

A different approach would have obviously been possible (even if I don't think it would have been convenient at all), but it would have led to a very different kind of language, because the properties to be functional, acyclic and deal with immutable objects are really the foundations for the VTL.

All what is said here can be said also for the issue #335.

What is possible in VTL is only to simulate an update by calculating a new dataset which contains a modified version of the another. This is described in #272.

In conclusion, I propose to close the issues #275 (this one) and the #335 because they are not feasible, and continue the discussion on the Issue #272.

capacma commented 7 years ago

I agree with the comment by @VincDV to close the issues #275 and the #335 and to continue the discussion on #272.