Closed R-Palazzo closed 1 year ago
@R-Palazzo, @amontanez24
Changed bins definition when discretizing continuous columns to include more values. Bins are only defined on real_data, so discretization produces NaN if synthetic data have different values (higher or lower).
This is the expected behavior:
0, 1, 2, .. 9
0
should be [-inf, <value>]
and bin 9
shoudl be [<value>, +inf]
0
9
@R-Palazzo if there are any other changes you're making outside the issue, please Slack me just in case.
@R-Palazzo, @amontanez24
Changed bins definition when discretizing continuous columns to include more values. Bins are only defined on real_data, so discretization produces NaN if synthetic data have different values (higher or lower).
This is the expected behavior:
- Generate the bins based on the real data. For example, assume there are 10 bins that are labeled
0, 1, 2, .. 9
- Adjust the lowest and highest bin edges to be -/+ infinity. That is, bin
0
should be[-inf, <value>]
and bin9
shoudl be[<value>, +inf]
Now use the bin edges to discretize the synthetic data. There shouldn't be any nans due to the step before.
- If a synthetic value is below the min of real data, it will be assigned to bin
0
- If a synthetic value is above the max of real data, it will be assigned to bin
9
@R-Palazzo if there are any other changes you're making outside the issue, please Slack me just in case.
Yes thanks for your message @npatki. I didn't specify in my message that I wanted to discuss this and would talk about it during the Eng. meeting. It's fine only considering the real_data, but because we drop the NaN, I thought we would lose some information.
I think with the above expected algorithm, we wouldn't lose any synthetic data values. Those that go above the real min/max would still be assigned to a bin. We can discuss more soon.
I think with the above expected algorithm, we wouldn't lose any synthetic data values. Those that go above the real min/max would still be assigned to a bin. We can discuss more soon.
Yes I agree, I like this option. I'm coding it ;)
Patch coverage: 87.63
% and project coverage change: +0.62
:tada:
Comparison is base (
492a42c
) 76.13% compared to head (582456b
) 76.75%.
:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Resolve #356.
I made few changes outside of the Issue. Mainly:
ColumnPairTrend
metrics, not over all the dataset.