Parent-child detection metrics in the case of multiple parents configuration

Problem Description

The current version of the parent-child detection metric works well when applied on the denormalized data with linear parent-child relationship scheme. However, we think that the process of the evaluation of the denormalized data can be improved when applied for the multi parent-child relationship. To illustrate this case we will use the biodegradability dataset as an example. The bond table has two parent tables (Table atom duplicated twice). The current version of parent-child detection proceeds by iterating the denormalization process for each parent table separately from each other parent. That is, the parent-child detection will:

Denormalize the bond table using the atom_id foreign key as the join field to obtain the denormalized table and then will compute the first detection metric score (s1).
Denormalize the original bond table using only the second foreign key atom_id2 to obtain the denormalized table and then will compute the second detection metric score (s2).
Compute after that, the mean score of s1 and s2. This computation method successfully evaluates separately the relationship between each parent table and the child table but we can identify two drawbacks:
1. The evaluation include the foreign key of the second parent table at each iteration
2. This method doesn’t take into account the indirect relationship between the two parent tables that they may have via the child table For this reason, we think that denormalizing parent and child tables in a single table is more relevant. For example, for the previous database the denormalized table will be constructed in a single step and gives the following table that can evaluate also the indirect relationship between the parents:

type_bond	type_atom	type_atom2
1	c	h
2	o	n
2	n	o
1	n	c
7	c	c

Hi @mohamedgy, thanks for filing this feature request. Definitely seems like we refine the parent-child detection metrics a bit more. A few of my own thoughts:

Scope: I believe the current metric(s) were only scoped for a single parent-child relationship due to potential issues that may arise in performance and accuracy. The biodegradability dataset is small, but I imagine if both parents had 100s of columns, then the resulting denormalized table will have many columns -- which isn't always the best for computation or predictive accuracy.

Schema: Seems like it's not just a multi-parent scenario that may run into this problem. If you have schema of higher depth such as grandparent --> parent --> child, then denormalizing all 3 tables may also provide some unique insights (correlations between grandparent and child). But this goes back to the scoping problem above.

Let's keep this issue open as we figure out how best to support these tradeoffs. It may involve new metrics, or parameters where users can control the denormalization.

Workaround

At least for now, there is a workaround where you can denormalize the tables yourself before applying the detection metrics.

sdv-dev / SDMetrics

Parent-child detection metrics in the case of multiple parents configuration #290

Problem Description

Workaround