tensorflow / text

Making text a first-class citizen in TensorFlow.
https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro
Apache License 2.0
1.21k stars 333 forks source link

negative sampling excludes positive class #1229

Closed hoosierEE closed 4 months ago

hoosierEE commented 8 months ago

Addressing issue #1228

review-notebook-app[bot] commented 8 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cantonios commented 8 months ago

Also, I just reviewed the documentation for tf.random.log_uniform_candidate_sampler. It explicitly states that it does not reject any accidental positive hits, and then links to the Candidate Sampling Algorithms Reference. In that reference, the "Negative Sampling" row says that it considers negative training classes to be the full set S_i, which does include positive samples. This is opposed to "Sampled Logistic", which considers the set (S_i - T_i). So it may be intentional that there could be accidental hits.

hoosierEE commented 8 months ago

Your feedback was very helpful, thanks! Building a set from the positive_skip_grams ended up being much faster. On average I see around 10% of negative samples discarded because they overlap with the positive context, resulting in about 1 percentage point improvement in training accuracy at 20 epochs.