Closed csala closed 2 years ago
When scoring the quality of a transformer on a dataset, we have been using the coefficient of determination for predicting each of the numeric columns in that dataset. I am unsure of how to compile this into a table like the one described above, because there are multiple scores for each dataset. Averaging them doesn't really make sense, since many of them might be close to 0 or even negative. Taking the max also doesn't make sense since one transformer might be better at predicting column A while another transformer of the same data type might be better at predicting column B and those scores might not be that close.
It is also worth noting that we tried predicting all the numeric columns together to see if that would yield one score per dataset, but it ended up just yielding bad scores for everything.
A function should be implemented to automatically validate the data quality of any new Transformer by running the quality tests mentioned in #252 and reporting the results.
Function Name and Module
The function should be implemented inside
tests/contributing.py
and should be calledvalidate_transformer_quality
Inputs
The function should accept a single input:
transformer (class or str)
: Transformer class or full Python name of the class (e.g.DatetimeTransformer
or"rdt.transformers.time.DatetimeTransformer"
)Outputs
The function should return
pandas.DataFrame
that contains information about the results obtained by the Transformer. The dataframe contains one row per dataset, and the following columns (column names here):Output DataFrame Example
Behavior
This function runs all the quality tests using the Transformer on all the real world datasets that contain the Transformer data type and produces a report based on how good the correlations are preserved and how good a synthetic data generator (a copulas.GaussianMultivariate?) is when trained on the data produced by this Transformer, also comparing it to the quality of the other transformers of the same Data Type
Prints to console
The function prints the following information in the console:
Usage Example