As a developer, it would be easier to make changes to the QualityReport if our code abstractions matched our logical and user facing abstractions. It's hard to add new properties, modify aggregation in existing ones and handle errors now because the results are being collected in a way that doesn't match the desired output.
SDMetrics has a concept of properties. These are a collection of similar metrics that can be used to tell us about one aspect of the synthetic data (eg. column pair trends). The problem is, our code doesn't have this concept. Instead the metrics are all collected and then converted into properties for the user facing output. This causes inefficiencies in the way the metrics are collected (the same columns/tables being looped over multiple times) and makes the code harder to read.
To solve these problems we propose adding a new module called _properties to the reports/single_table folder and creating a BaseSingleTableProperty class.
Expected behavior
Attributes
metrics: A list of metrics that make up the property. (This may be unnecessary)
_details: A dataframe containing the details of each score and column/table involved. This will be used to compute averages, create graphs and return the details at a higher level.
Abstract methods
get_score(real_data, synthetic_data, metadata, progress_bar) - Returns a float that is the average score of all the individual metric scores computed.
get_visualization() - Returns a plotly.graph_objects._figure.Figure object.
Additional context
Put this base class in its own file. Make sure to name the module with an underscore since it is not intended to be public.
The proposal is to store the details as a dataframe. Currently, all of this information is stored in a dict on the QualityReport class called _metric_results. The benefit of storing it as a dataframe is that this is how it is returned to the user in QualityReport.get_details. The issue is that all of the utility functions for plotting the metrics are designed to take in the dict. We might want to investigate if it is worth changing the data structure we use to return the results and ultimately if we should update these plot functions.
We will be passing the progress bar created in the reports down to the get_score method so that it can appropriately update.
Problem Description
As a developer, it would be easier to make changes to the
QualityReport
if our code abstractions matched our logical and user facing abstractions. It's hard to add new properties, modify aggregation in existing ones and handle errors now because the results are being collected in a way that doesn't match the desired output.SDMetrics
has a concept of properties. These are a collection of similar metrics that can be used to tell us about one aspect of the synthetic data (eg. column pair trends). The problem is, our code doesn't have this concept. Instead the metrics are all collected and then converted into properties for the user facing output. This causes inefficiencies in the way the metrics are collected (the same columns/tables being looped over multiple times) and makes the code harder to read.To solve these problems we propose adding a new module called
_properties
to thereports/single_table
folder and creating aBaseSingleTableProperty
class.Expected behavior
Attributes
metrics
: A list of metrics that make up the property. (This may be unnecessary)_details
: A dataframe containing the details of each score and column/table involved. This will be used to compute averages, create graphs and return the details at a higher level.Abstract methods
get_score(real_data, synthetic_data, metadata, progress_bar)
- Returns a float that is the average score of all the individual metric scores computed.get_visualization()
- Returns aplotly.graph_objects._figure.Figure
object.Additional context
QualityReport
class called_metric_results
. The benefit of storing it as a dataframe is that this is how it is returned to the user inQualityReport.get_details
. The issue is that all of the utility functions for plotting the metrics are designed to take in the dict. We might want to investigate if it is worth changing the data structure we use to return the results and ultimately if we should update these plot functions.get_score
method so that it can appropriately update.