sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

Create single table column pair trends property #368

Closed R-Palazzo closed 1 year ago

R-Palazzo commented 1 year ago

Resolve #356.

I made few changes outside of the Issue. Mainly:

npatki commented 1 year ago

@R-Palazzo, @amontanez24

Changed bins definition when discretizing continuous columns to include more values. Bins are only defined on real_data, so discretization produces NaN if synthetic data have different values (higher or lower).

This is the expected behavior:

  1. Generate the bins based on the real data. For example, assume there are 10 bins that are labeled 0, 1, 2, .. 9
  2. Adjust the lowest and highest bin edges to be -/+ infinity. That is, bin 0 should be [-inf, <value>] and bin 9 shoudl be [<value>, +inf]
  3. Now use the bin edges to discretize the synthetic data. There shouldn't be any nans due to the step before.
    • If a synthetic value is below the min of real data, it will be assigned to bin 0
    • If a synthetic value is above the max of real data, it will be assigned to bin 9

@R-Palazzo if there are any other changes you're making outside the issue, please Slack me just in case.

R-Palazzo commented 1 year ago

@R-Palazzo, @amontanez24

Changed bins definition when discretizing continuous columns to include more values. Bins are only defined on real_data, so discretization produces NaN if synthetic data have different values (higher or lower).

This is the expected behavior:

  1. Generate the bins based on the real data. For example, assume there are 10 bins that are labeled 0, 1, 2, .. 9
  2. Adjust the lowest and highest bin edges to be -/+ infinity. That is, bin 0 should be [-inf, <value>] and bin 9 shoudl be [<value>, +inf]
  3. Now use the bin edges to discretize the synthetic data. There shouldn't be any nans due to the step before.

    • If a synthetic value is below the min of real data, it will be assigned to bin 0
    • If a synthetic value is above the max of real data, it will be assigned to bin 9

@R-Palazzo if there are any other changes you're making outside the issue, please Slack me just in case.

Yes thanks for your message @npatki. I didn't specify in my message that I wanted to discuss this and would talk about it during the Eng. meeting. It's fine only considering the real_data, but because we drop the NaN, I thought we would lose some information.

npatki commented 1 year ago

I think with the above expected algorithm, we wouldn't lose any synthetic data values. Those that go above the real min/max would still be assigned to a bin. We can discuss more soon.

R-Palazzo commented 1 year ago

I think with the above expected algorithm, we wouldn't lose any synthetic data values. Those that go above the real min/max would still be assigned to a bin. We can discuss more soon.

Yes I agree, I like this option. I'm coding it ;)

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 87.63% and project coverage change: +0.62 :tada:

Comparison is base (492a42c) 76.13% compared to head (582456b) 76.75%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #368 +/- ## ========================================== + Coverage 76.13% 76.75% +0.62% ========================================== Files 80 81 +1 Lines 3155 3339 +184 ========================================== + Hits 2402 2563 +161 - Misses 753 776 +23 ``` | [Impacted Files](https://app.codecov.io/gh/sdv-dev/SDMetrics/pull/368?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [...rts/single\_table/\_properties/column\_pair\_trends.py](https://app.codecov.io/gh/sdv-dev/SDMetrics/pull/368?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c2RtZXRyaWNzL3JlcG9ydHMvc2luZ2xlX3RhYmxlL19wcm9wZXJ0aWVzL2NvbHVtbl9wYWlyX3RyZW5kcy5weQ==) | `87.43% <87.43%> (ø)` | | | [...column\_pairs/statistical/correlation\_similarity.py](https://app.codecov.io/gh/sdv-dev/SDMetrics/pull/368?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c2RtZXRyaWNzL2NvbHVtbl9wYWlycy9zdGF0aXN0aWNhbC9jb3JyZWxhdGlvbl9zaW1pbGFyaXR5LnB5) | `88.70% <100.00%> (ø)` | | | [...trics/reports/single\_table/\_properties/\_\_init\_\_.py](https://app.codecov.io/gh/sdv-dev/SDMetrics/pull/368?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c2RtZXRyaWNzL3JlcG9ydHMvc2luZ2xlX3RhYmxlL19wcm9wZXJ0aWVzL19faW5pdF9fLnB5) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.