Description

The algorithm, on line number 71 of test_fair_clustering.py, has a variable called 'fairness'. The code claims that setting this value to false would provide unfair clustering results. Furthermore, it also has a tradeoff controller lambda, which when set to 0 should produce unfair results.
However while testing intuitive datasets, the output still appears to be that of a fair algorithm as opposed to an unfair one.

To do

Understand how the variable 'fairness' is affecting the code and how to produce unfair results when it is set to zero (likewise for trade-off controller lambda)

Example

After setting 'fairness' to false, an intuitive dataset like

Protected group 1: [1, 1] [2, 1] [1.5, 2] [2.5, 2]
Protected group 2: [5, 1] [6, 1] [5.5, 2] [6.5, 2]

produces following output:

Whereas it should produce:

Some major observations for understanding the issue

For imitating an example like above, we created a dataset for our specific case -

id,name,state,10-14,15-19,latitude,longitude
1,Mumbai ,MAHARASHTRA ,0,1,0.333909,1.272195
2,Pune ,MAHARASHTRA ,0,1,0.322582,1.289061
3,Nashik ,MAHARASHTRA ,0,1,0.348765,1.287309
4,Ahmadnagar ,MAHARASHTRA ,0,1,0.411741,1.273569
5,Kolhapur ,MAHARASHTRA ,1,0,0.291182,1.295745
6,Solapur ,MAHARASHTRA ,1,0,0.308269,1.325013
7,Belgaum ,KARNATAKA ,1,0,0.276900,1.300422
8,Dharwad ,KARNATAKA ,1,0,0.269694,1.309014

The dataset has 2 protected groups - 10-14 age group and 15-19 age group. Clearly, the first 4 data points are in the second protected group and the last 4 data points are in the first protected group. Also, according to their latitude and longitude, they are quite similarly located as in the above comment.

First run with this dataset

Different colors represent different clusters and different shapes represent different protected groups. Cluster centers are represented by circles of that color.

scatter_plot

This image (of 'Making sense of the output') clearly shows that the output is not balanced! Surprising!

What could be the issue?

Algorithm?
Dataset?

After trying things like normalizing the dataset before passing it to the algorithm, and changing the fairness variable in the code to 'False', the output was not changing. But, then there was a thought of changing the value of Lambda and make it a single value of instead of a range (lambda_tune)

Changing the value for lambda

Output for lambda = 1

scatter_plot

This is absolutely balanced output! (perfect representation of both the protected groups in both the clusters!) Seems like a victory! But why does equating lambda = 1 suddenly gives such a convincing result? After trying, lambda = 1,2,3,...,15,16 gave the same result. But for lambda >= 17, it started giving un-fair results. But why? Also, this time changing the fairness variable in the code to False did give a different result (un-fair result).

Important takeaways

Normalizing the dataset does not seem to change the output (it does not affect the balance of the final clusters).
The variable 'fairness' does affect the output (if it's true then the output is fair else un-fair).
Putting lambda = 1 seems to be a safer bet for getting a fair output for a given dataset.

Questions to answer

What is lambda-tune in the code exactly doing?
If the paper says to choose lambda = 1 for absolute fairness, why does the code provide values for lambda starting from 100 by default?
Why the output is fair for values of lambda from 1 to 16 and un-fair beyond that?
The paper says that lambda = 0 means no fairness and lambda = 1 means absolute fairness, then what is the significance of all the other lambda values?

raghav-khanna / Facility-Location-India

[Ziko et.al] Algorithm keeps producing fair results despite setting 'fairness' to False and lambda = 0.0 #13