Code does not match explanation

hoosierEE commented 1 year ago

The word2vec tutorial at first gives one definition of negative sampling:

A negative sample is defined as a (target_word, context_word) pair such that the context_word does not appear in the window_size neighborhood of the target_word. For the example sentence, these are a few potential negative samples (when window_size is 2)

However, the implementation uses a second definition:

To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary.

There are several places where this second definition is used. First in the "small" example:

# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

It's used again in the Summary diagram, and later in the definition for generate_training_data:

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

With a large enough sequence, random sampling is unlikely to pick samples near target_word purely by chance, and as a result the model "works". However if you test with a small example, you can see that this form of sampling excludes only the context_word.

My understanding is that for a context window of [the wide road shimmered] with the target word road, the positive (+) and negative (-) examples should be like this:

[the wide road shimmered] in the hot sun
 +++ ++++      +++++++++  -- --- --- ---

Positive samples for road come from [the, wide, shimmered] and negative samples for the context word shimmered come from [in, the, hot, sun].

Either the text's definition of negative sampling should be changed, or the code should be changed to discard positive samples from the neg_sampling_candidates.

cantonios commented 1 year ago

Agreed, do you want to adjust the code and create a PR to exclude all context words for the target word?

hoosierEE commented 1 year ago

I'll give it a try and let you know with a PR.

hoosierEE commented 1 year ago

I don't usually work with notebooks so please excuse the noisy diff. It looks like there was a bunch of HTML escaping in the original that wasn't present in the .ipynb downloaded from colab.

I saw an improvement in accuracy for the same number of epochs (92% versus 89%) but generate_training_data runs more slowly (about 2m versus <1m on colab). This is the important part of the diff:

+    # Generate positive context windows for each target word in the sequence.
+    window = defaultdict(list)
+    for i in range(window_size, len(sequence)-window_size):
+      window[sequence[i]].append(sequence[i-window_size:1+i+window_size])

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

+      # Discard iteration if negative samples overlap with positive context.
+      for target in window[target_word]:
+        if not any(t in target for t in negative_sampling_candidates):
+          break  # All candidates are true negatives: use this skip_gram.
+      else:
+        continue # Discard this skip_gram

No changes to the diagrams, and I left the prose unchanged except for a small correction to the Negative sampling for one skip-gram section:

-  You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.
+  You can pass words from the positive class but this does not exclude them from the results. For large vocabularies, this is not a problem because the chance of drawing one of the positive classes is small. However for small data you may see overlap between negative and positive samples. Later we will add code to exclude positive samples for slightly improved accuracy at the cost of longer runtime.

simonwardjones commented 2 months ago

Hi, thanks for reporting this. I just wanted to add that this still seems to be an issue in the tensorflow docs.

tensorflow / text

Code does not match explanation #1228