Division operator error

amattioc commented 7 months ago

Reference Manual

As per the RM, the division should always return a number (3183). Is this correct?

Anyway, in the examples, divisions of two integers seem to return integers (and the result is placed in the same integer column measure: example 1, Me_1 and example 2, Me_1).

antonio-olleros commented 7 months ago

I think that the examples may be misleading because the divisions presented could yield an integer. I think it would be clearer if at least one result had decimals.

In my opinion, the result has to be always number, and it should be placed in the same column (it is not an operator with changing data type behaviour)

amattioc commented 7 months ago

Hi Antonio,

I agree that the division should always return a number and if so, when it divides two integers, the result cannot be placed in the same column (as examples 1 and 2 seem to suggest). Me_1 is an integer in both ds1 and ds2, but the result of ds1/ds2 transforms Me_1 into a number and for this reason it should not be placed in the same column (or the column should be changed in the data type). Do you agree?

antonio-olleros commented 7 months ago

Well, I agree in theory, but I think we don't want that!! Note that this would imply that divisions could only be performed on monomeasure datasets, as it always happens when we have a change in data type (because if we change the data type, we have to change the variable (to num_var, for instance), and we can do that only for one variable).

But it is true that, at the moment, the model says that a variable name can only have one data type. So I think we have two options:

We say that the data type includes it supertypes, so in this case it would be possible to change from integer to number
We drop that constraint from the model, so that a variable (name) can have different data types.

I strongly favour the second one, as I said during the Madrid meeting. I think that this is a very strong constraint, for abstract theoretical reasons, with:

Bad practical consequences (e.g., operators changing the data types can only take as input monomeasure datasets),
very difficult to enforce in practice (is anybody checking that if I calc a component to a datasets, the same name is never used with a different type in other datasets? I guess not, and I would not want to do it, that would be a serious limitation for users)
No real benefit (I cannot find any example).

egreising commented 7 months ago

Dear Antonio and All, I think you are right from a pragmatic point of view, but playing the Devil's advocate, what if you change the data type from double to date? That would mean a change in the semantics of the variable, and this is very bad from the data modelling perspective. I think the user should be aware of this fact and, if you are supposed to do an operation that requires a double to store the result, then the data type should be double from the begining, not integer. If somebody defined this measure as integer, it is not foreseen to store decimals. On the other hand, a number with decimals should not give an error when stored in an integer. It should truncate or round the number to an integer. ROUND and TRUNCATE should be parameters of the operation, with one of them chosen as a default.

antonio-olleros commented 7 months ago

Hi Edgardo, I think we should not mix the dictionaries/modelling part (to be left to SDMX or others) and the VTL. I completely understand, and agree with, the use case for dictionaries and data modelling, and I agree that it would be bad from data modelling perspective to have something like that.

But the issue here is different, because VTL is not meant to serve for modelling or as a data dictionary, but to validate and transform datasets.

For instance, the VTL has left out of its scope (thanks God!) the management of Agencies/Owners. Now, it is perfectly natural to use two datasets coming from two different agencies in one Transformation Scheme (e.g., A BIS dataset in dollars, and ECB's EXR dataset to convert to EUR). How can we ensure that the modellers in BIS and EXR have been consistent among them and have not used the same name for different things? We can't. And if we take 100% seriously the statement for VTL that one variable can only take one data type, then if both BIS and ECB have used the variable VAR_EXAMPLE, one as integer and the other as string, we would not be able to do anything with VTL for these two datasets together...

Another thing, most of the SDMX implementations I have seen in my life do not provide a representation at Concept level. Which in practice means that you can have any different representation/data type for the same variable in different Data Flows... So, if we implement that seriously in VTL (please not!), I think we would even be going against SDMX!

amattioc commented 7 months ago

Hi Antonio, I agree that VTL should not be involved in modelling, but we always have to keep an eye on the scenarios where it will be used. SDMX is not the only one, but it is an important one. The correct balance between robustness and flexibility is key for the success of this technology in the real world.

I was wondering if, for cases like the division, we could imagine to have different constraints for persistent and non persistent assignments. I can imagine that only the persistent ones would have possible conflicts with the "modelling world" (e.g. in SDMX with the fromVTL mappings). Non persistent assignments are probably related to the internals of a transformation and could have a more relaxed type management (e.g. including supertypes, treating codelists as strings if needed).

antonio-olleros commented 6 months ago

Hi!

I think that would be difficult to implement and understand. Also, you may want to create a TS taking as input the your dataset, but to generate a dataset from another institution, so the same name can be used with different meanings in input and output.

I think that the constraint of the names has no utility outside the modelling part, which I think should be addressed outside VTL. If you agree with that, then I think it is clear that we should drop it, because it only creates problems and does not add any advantage!

egreising commented 6 months ago

Hi All! I have some remarks on the past two or three comments.

I am not against mixing data from different agencies, or different data sources in general. It is up to the users to know the data they are using and to keep things coherent. This is not a "naming" issue, because having the same name for different concepts is as bad as having different names for the same concept, and both cases are totally outside the scope of VTL, and are independent of the type of data you are dealing with, not exclusive for SDMX.
That said, if you have two variables in two different data sources with the same name, you will have to differentiate each other (disambiguate), either changing the "internal" names in your script, or qualifying them with the agency id, that would be the normal way in SDMX, since the ID of a concept in SDMX is the triplet Agency+Name+Version. But the ability of mixing data sources is not under discussion, IMHO.
We should differentiate "dealing with the information" from "breaking the information integrity", and this is something that does depend on the type of data you are managing. In SDMX the compliance with the IM constraints is very strict, and it prevents any change in the artefacts' structural metadata. Any modification means creating a new artefact by modifying its version number, which is part of the ID. So, if you are going to change the data type of an input concept, the output artefact must be a new one, with a different Agency, name and/or version. Otherwise, the SDMX Registry should not allow persisting the output dataset.
I like Attilio's proposal about differentiating persistent and non persistent assignments, allowing any "manipulation" internally, but restricting what you are able to persist. Nevertheless, as I mentioned above, if the output is SDMX, the Registry should put the limits, anyway.

linardian commented 6 months ago

Good morning, This is a very interesting discussion, and I think it is worth to put the general topic in the agenda of the next meeting in Salamanca. Very briefly this is my opinion, shared also with Attilio:

VTL is going toward an "enterprise" use and not only as a "calculator"; this is a very good news (see also ECB's adoption for CDM project) and that is why I have reserved a session of the next meeting to this aspect of VTL implementation;
consequently the "border" of a VTL transformation (input and output persistent cubes) have to be considered "embedded" in an enterprise statistical word (SDMX-base, DDI-based, Matrix-based) with an active statistical dictionary;
the distinction between "non-persistent" and "persistent" assignments makes a very big difference; so some assumptions may be "relaxed" for intermediary results, but have to (must) be enforced for the final outputs;
the type-changing operators in my humble opinion should have a "different facet" in some cases with respect to the persintency; this is especially true for multi-measures cubes (see also previous discussions about the check operator;
for he moment this complete revision can not be done for the 2.1 release, so surely will be a very important topic to be analysed and consolidated for next release(s)
in the time being, my proposal is to change the examples for the division operator using just mono-measure cubes for the first two and a multi-measure cube for the third example; here attached is my proposal for the new 2.1 version of RM

Sorry for the long comment, but I hope it will be of some help Division examples.docx

amattioc commented 6 months ago

I opened a new discussion (#409) to track this. In the meantime, for v2.1, I fixed the examples according to the current behaviour. This error seems to be related to most arithmetic operators that modify the type of the result with respect to the input (#406, #407).

sdmx-twg / vtl

Division operator error #404