Open StaffanBetner opened 1 year ago
Based on how Jaccard is defined I would code them as zeros. I presume, however, that you actually want to distinguish them from votes against -- which raises questions about what metric you should be using. It might actually make some sense to code votes for as 1, votes against as -1, and abstentions and absences as 0 and use cosine distance?
I ended up precalculating a distance matrix and providing that, such that I only compare actual voting decisions i.e. not absences which are not meaningful in a Swedish context, in contrast to abstentions which have an intentional meaning. Here is my R code which may benefit someone else (I am using umap through reticulate):
# vectors a and b should be equal length, e.g. the full voting record of an individual
# this calculates the distances pairwise
jaccard_dist <- function(a, b) {
if(length(a) != length(b)){stop("Unequal lengths")}
intersection = sum(Vectorize(`==`)(a, b), na.rm = TRUE)
union = length(na.omit(a))+length(na.omit(b)) - intersection
output = 1-(intersection/union)
return(output)
}
# to create a distance matrix
usedist::dist_make(dat_mat, jaccard_dist) -> dist_output
And here is ChatGPT's translation into Python 😀
import numpy as np
from scipy.spatial.distance import pdist, squareform
def jaccard_dist(a, b):
if len(a) != len(b):
raise ValueError("Unequal lengths")
intersection = np.sum(np.equal(a, b), where=~np.isnan(a) & ~np.isnan(b))
union = (len(a) - np.isnan(a).sum()) + (len(b) - np.isnan(b).sum()) - intersection
output = 1 - (intersection / union)
return output
def dist_make(dat_mat, distance_function):
dist_output = squareform(pdist(dat_mat, metric=distance_function))
return dist_output
# To create a distance matrix
# Replace 'data_matrix' with the actual data matrix you are working with.
dist_output = dist_make(data_matrix, jaccard_dist)
My case is some roll call data where the members of parliament can be away (in a non-meaningful way) or just be on their seat temporarily, so that it is encoded as a missing value. However, I get errors due to missing data. Since the Jaccard metric itself is agnostic to the amount of available information, is there any way to handle this?