Open DarkMenacer opened 8 months ago
For imitating an example like above, we created a dataset for our specific case -
id,name,state,10-14,15-19,latitude,longitude
1,Mumbai ,MAHARASHTRA ,0,1,0.333909,1.272195
2,Pune ,MAHARASHTRA ,0,1,0.322582,1.289061
3,Nashik ,MAHARASHTRA ,0,1,0.348765,1.287309
4,Ahmadnagar ,MAHARASHTRA ,0,1,0.411741,1.273569
5,Kolhapur ,MAHARASHTRA ,1,0,0.291182,1.295745
6,Solapur ,MAHARASHTRA ,1,0,0.308269,1.325013
7,Belgaum ,KARNATAKA ,1,0,0.276900,1.300422
8,Dharwad ,KARNATAKA ,1,0,0.269694,1.309014
The dataset has 2 protected groups - 10-14 age group and 15-19 age group. Clearly, the first 4 data points are in the second protected group and the last 4 data points are in the first protected group. Also, according to their latitude and longitude, they are quite similarly located as in the above comment.
Different colors represent different clusters and different shapes represent different protected groups. Cluster centers are represented by circles of that color.
This image (of 'Making sense of the output') clearly shows that the output is not balanced! Surprising!
What could be the issue?
After trying things like normalizing the dataset before passing it to the algorithm, and changing the fairness variable in the code to 'False', the output was not changing. But, then there was a thought of changing the value of Lambda and make it a single value of instead of a range (lambda_tune)
Output for lambda = 1
This is absolutely balanced output! (perfect representation of both the protected groups in both the clusters!) Seems like a victory! But why does equating lambda = 1 suddenly gives such a convincing result? After trying, lambda = 1,2,3,...,15,16 gave the same result. But for lambda >= 17, it started giving un-fair results. But why? Also, this time changing the fairness variable in the code to False did give a different result (un-fair result).
Description
The algorithm, on line number 71 of test_fair_clustering.py, has a variable called 'fairness'. The code claims that setting this value to false would provide unfair clustering results. Furthermore, it also has a tradeoff controller lambda, which when set to 0 should produce unfair results.
However while testing intuitive datasets, the output still appears to be that of a fair algorithm as opposed to an unfair one.
To do
Understand how the variable 'fairness' is affecting the code and how to produce unfair results when it is set to zero (likewise for trade-off controller lambda)
Example
After setting 'fairness' to false, an intuitive dataset like
Protected group 1: [1, 1] [2, 1] [1.5, 2] [2.5, 2]
Protected group 2: [5, 1] [6, 1] [5.5, 2] [6.5, 2]
produces following output:
Whereas it should produce: