keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

skipgrams negative sample should only sample from out of window words #356

Closed goodcheer closed 2 years ago

goodcheer commented 2 years ago

When doing negative sampling, the indices should be sampled from outside of current window, by definition.

However, in tf.keras.preprocessing.sequence.skipgrams, when sampling [center word index, context word index], context word index is sampled from whole range of index, including corresponding within-window context word indices. (line 225)

https://github.com/keras-team/keras-preprocessing/blob/4538765fd369def80f81ad977bcf8e40e58c2f82/keras_preprocessing/sequence.py#L219-L230

As a result, positive couples of [center word index, within-window context word index] might have two opposing label (0: negative, 1: positive).

I was able to verify this issue with following simple code.

from tensorflow.keras.preprocessing.sequence import skipgrams

seq = [1, 2, 3, 4]

sgns = skipgrams(seq, 5, window_size=3, negative_samples=1)
sgps  = skipgrams(seq, 5, window_size=3, negative_samples=0)

def find_mislabeled(sg_arr):
    hmap = {}
    for couple, label in zip(*sg_arr):
        key = str(couple)
        if key in hmap:
            hmap[key].add(label)
        else:
            hmap[key] = {label,} 
    return {k: v for k, v in hmap.items() if len(v) > 1}

print(len(find_mislabeled(sgps)) == 0) # True
print(len(find_mislabeled(sgns)) == 0) # False
goodcheer commented 2 years ago

posted issue to maintained keras repository.