phurwicz / hover

:speedboat: Label data at scale. Fun and precision included.
https://phurwicz.github.io/hover
MIT License
323 stars 19 forks source link

Use of semi-supervised fit step for Umap and ivis #61

Closed clemgaut closed 1 year ago

clemgaut commented 1 year ago

Hello,

Thank you so much for providing hover as an open source tool!

I was wondering if it would be possible to have the option to make a semi-supervised fit with umap or ivis. Indeed, from what I understand, both umap and ivis are fit in an unsupervised way: only the embedding information is used in the fit step. For data belonging to the train and dev sets (public sets from hover implementation), hover knows what class they belong to. As both umap and ivis support providing labels during the fit step (-1 if no label is known, to do semi-supervised fit), I was wondering if you considered adding the class information in the fit step?

phurwicz commented 1 year ago

Hi @clemgaut, thanks for using hover!

Semi-supervised fit should be do-able right now. The trick would be

For example, let's say you are using unsupervised fit with some vectorizer function:

def vectorizer_unsupervised(feature):
    vector = some_pretrained_model.predict(feature)
    return vector

You can pre-compute a feature -> label lookup using your labeled data and do

# assuming a dictionary called "lookup" that maps labelled data to integer label
# also assuming that you already know the number of classes for your classifier
# if not, just set NUM_CLASSES to be large enough to cover known classes
NUM_CLASSES = 3

def one_hot_encoding(feature, num_classes):
    '''
    One-hot vector for labeled data. For unlabeled data, return a zero-valued vector.
    '''
    arr = np.zeros(num_classes)
    label = lookup.get(feature, -1)
    if label >= 0:
        arr[label] = 1.0
    return arr

def vectorizer_semisupervised(feature):
    '''
    Vectorizer to pass to dimensionality reduction.
    '''
    vec1 = vectorizer_unsupervised(feature)
    vec2 = one_hot_encoding(feature, NUM_CLASSES)
    return np.concatenate(vec1, vec2)

So umap or ivis will just work the same way as unsupervised, but you've baked label information into the vectors.

clemgaut commented 1 year ago

Thank you, for your answer, I will also look into using the predefined functions of umap and ivis for unsupervised learning. I might do a PR if I get something working eventually.