lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

How should missing values be handled for Jaccard metric? #986

Open StaffanBetner opened 1 year ago

StaffanBetner commented 1 year ago

My case is some roll call data where the members of parliament can be away (in a non-meaningful way) or just be on their seat temporarily, so that it is encoded as a missing value. However, I get errors due to missing data. Since the Jaccard metric itself is agnostic to the amount of available information, is there any way to handle this?

lmcinnes commented 1 year ago

Based on how Jaccard is defined I would code them as zeros. I presume, however, that you actually want to distinguish them from votes against -- which raises questions about what metric you should be using. It might actually make some sense to code votes for as 1, votes against as -1, and abstentions and absences as 0 and use cosine distance?

StaffanBetner commented 1 year ago

I ended up precalculating a distance matrix and providing that, such that I only compare actual voting decisions i.e. not absences which are not meaningful in a Swedish context, in contrast to abstentions which have an intentional meaning. Here is my R code which may benefit someone else (I am using umap through reticulate):

# vectors a and b should be equal length, e.g. the full voting record of an individual
# this calculates the distances pairwise
jaccard_dist <- function(a, b) {
  if(length(a) != length(b)){stop("Unequal lengths")}
    intersection = sum(Vectorize(`==`)(a, b), na.rm = TRUE)
  union = length(na.omit(a))+length(na.omit(b)) - intersection
  output = 1-(intersection/union)
  return(output)
}

# to create a distance matrix
usedist::dist_make(dat_mat, jaccard_dist) -> dist_output

And here is ChatGPT's translation into Python 😀

import numpy as np
from scipy.spatial.distance import pdist, squareform

def jaccard_dist(a, b):
    if len(a) != len(b):
        raise ValueError("Unequal lengths")

    intersection = np.sum(np.equal(a, b), where=~np.isnan(a) & ~np.isnan(b))
    union = (len(a) - np.isnan(a).sum()) + (len(b) - np.isnan(b).sum()) - intersection
    output = 1 - (intersection / union)
    return output

def dist_make(dat_mat, distance_function):
    dist_output = squareform(pdist(dat_mat, metric=distance_function))
    return dist_output

# To create a distance matrix
# Replace 'data_matrix' with the actual data matrix you are working with.
dist_output = dist_make(data_matrix, jaccard_dist)