sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
212 stars 45 forks source link

Parent-child detection metrics in the case of multiple parents configuration #290

Open mohamedgy opened 1 year ago

mohamedgy commented 1 year ago

Problem Description

The current version of the parent-child detection metric works well when applied on the denormalized data with linear parent-child relationship scheme. However, we think that the process of the evaluation of the denormalized data can be improved when applied for the multi parent-child relationship. To illustrate this case we will use the biodegradability dataset as an example. The bond table has two parent tables (Table atom duplicated twice). The current version of parent-child detection proceeds by iterating the denormalization process for each parent table separately from each other parent. That is, the parent-child detection will:

type_bond type_atom type_atom2
1 c h
2 o n
2 n o
1 n c
7 c c

npatki commented 1 year ago

Hi @mohamedgy, thanks for filing this feature request. Definitely seems like we refine the parent-child detection metrics a bit more. A few of my own thoughts:

Scope: I believe the current metric(s) were only scoped for a single parent-child relationship due to potential issues that may arise in performance and accuracy. The biodegradability dataset is small, but I imagine if both parents had 100s of columns, then the resulting denormalized table will have many columns -- which isn't always the best for computation or predictive accuracy.

Schema: Seems like it's not just a multi-parent scenario that may run into this problem. If you have schema of higher depth such as grandparent --> parent --> child, then denormalizing all 3 tables may also provide some unique insights (correlations between grandparent and child). But this goes back to the scoping problem above.

Let's keep this issue open as we figure out how best to support these tradeoffs. It may involve new metrics, or parameters where users can control the denormalization.

Workaround

At least for now, there is a workaround where you can denormalize the tables yourself before applying the detection metrics.