Some times it is necessary to denormalise many to many relationships. This is especially relevant for cases of micro data that are aggregated into multidimensional data. At the moment it is not possible (or way too difficult) to achieve this with VTL.
For instance, consider AnaCredit, that has information on Loans and Legal persons (entities), having a many to many relationship. Therefore, there are three datasets as follows:
Instrument:
instrumentId
nominalAmount
1
200
2
300
3
100
Entities
entityId
sector
1
S11
2
S2
3
S121
InstrumentEntities
instrumentId
entityId
role
liabilityAmount
1
3
cred
1
1
deb
2
3
cred
2
2
cred
2
1
deb
3
3
cred
3
1
deb
50
3
2
deb
50
If we want to calculate an aggregate by sectors, for instance, we would need to join the data to get a table like the following:
instrumentId
creditorId
debtorId
nominalAmountInstrument
sectorCreditor
sectorDebtor
liabilityAmount
1
3
1
200
S121
S11
2
3
1
300
S121
S11
2
2
1
300
S2
S11
3
3
1
100
S121
S11
50
3
3
2
100
S121
S2
50
And, to get there, we need the following steps:
Create two datasets, CRED and DEB, by subspacing the relationships table by its role component and renaming. CRED has as identifiers instrumentId and creditorId, while DEB has as identifiers InstrumentId and debtorId.
Finally, CRED and DEB datasets should be joined. But that kind of join is currently not accepted, because it is not the case that the identifiers of one dataset are a subset of the other
Proposed Solution
We assume that the reason for not allowing a join when the identifiers of one dataset are not a subset of the identifiers of the other is because this could imply null values for identifiers (in the example above, because a loan does not have a debtor).
It should be noted that with this type of join we would never get duplicate combinations of identifiers.
Then, a possible solution would be the possibility to add a new option for joins that allows joins when the identifiers of a dataset are not a subset of a dataset of the other, but makes it mandatory to use an nvl clause in the using for the non common identifiers.
The example above would be written:
CRED := EntityInstrument
[sub role = 'cred']
[rename entityId to creditorId];
DEB := EntityInstrument
[sub role = 'deb']
[rename entityId to debtorId];
result <-
inner_join(
CRED,
DEB
using InstrumentId, nvl(creditorId, "N/A"), nvl(debtorId, "N/A")
);
Issue Description
Some times it is necessary to denormalise many to many relationships. This is especially relevant for cases of micro data that are aggregated into multidimensional data. At the moment it is not possible (or way too difficult) to achieve this with VTL.
For instance, consider AnaCredit, that has information on Loans and Legal persons (entities), having a many to many relationship. Therefore, there are three datasets as follows:
And, to get there, we need the following steps:
Proposed Solution
We assume that the reason for not allowing a join when the identifiers of one dataset are not a subset of the identifiers of the other is because this could imply null values for identifiers (in the example above, because a loan does not have a debtor). It should be noted that with this type of join we would never get duplicate combinations of identifiers.
Then, a possible solution would be the possibility to add a new option for joins that allows joins when the identifiers of a dataset are not a subset of a dataset of the other, but makes it mandatory to use an nvl clause in the using for the non common identifiers.
The example above would be written: