[Data] Utility Increase Skewness

christianadriano commented 4 years ago

The plot [1] shows that we only have positive skewness*, in other words, the mean and median are larger than the mode (most frequent value) [2].

@brrrachel , follows a few questions for reflection: 1- What are the implications for the risk of overshooting of undershooting the estimates of the utility increase value? For instance, is data skewness a larger threat to adding bias or variance to our algorithms? 2- Should we also have negative skewed data? 3- Or neither, we should work with data that has very little skewness (close to zero value)?

[1] https://github.com/hpi-sam/rl-4-self-repair/blob/master/data_analysis.ipynb [2] https://en.wikipedia.org/wiki/File:Relationship_between_mean_and_median_under_different_skewness.png

brrrachel commented 4 years ago

I would assume that working with positive skewed data would increase the probability to get a lower value compared to the average.

As already mentioned in the lecture, I did some research according possible approaches about how to transform the positive skewed data:

Cube root transformation (converting x to x^(1/3))
Sqaure root transformation (only for positive values)
logarithm transformation (logarithm to base 10 of x or log to base e of x or log to base 2 of x)

Sona mentioned something about discarding data (outliers) - does it be another approach?

brrrachel commented 4 years ago

Implemented transformation functions and plots to compare the results. Introduced these transformation functions as well to the DataHandler in order to test it in the environment @MrBanhBao .

hpi-sam / rl-4-self-repair

[Data] Utility Increase Skewness #1