When running semantic text segmentation, I found that if the input utterance line is all stop words, (i.e. "Bye. Uh huh. Yeah."), SemanticTextSegmentation._get_similarity fails with ValueError: Input contains NaN.
I found that adding a check for nan in both embeddings could solve this problem.
def _get_similarity(self, text1, text2):
sentence_1 = [i.text.strip()
for i in nlp(text1).sents if len(i.text.split(' ')) > 1]
sentence_2 = [i.text.strip()
for i in nlp(text2).sents if len(i.text.split(' ')) > 2]
embeding_1 = model.encode(sentence_1)
embeding_2 = model.encode(sentence_2)
embeding_1 = np.mean(embeding_1, axis=0).reshape(1, -1)
embeding_2 = np.mean(embeding_2, axis=0).reshape(1, -1)
if np.any(np.isnan(embeding_1)) or np.any(np.isnan(embeding_2)):
return 1
sim = cosine_similarity(embeding_1, embeding_2)
return sim
I would like to have someone else look at it because I don't want to make any assumptions that the stop words should be part of the same segments.
When running semantic text segmentation, I found that if the input utterance line is all stop words, (i.e. "Bye. Uh huh. Yeah."),
SemanticTextSegmentation._get_similarity
fails withValueError: Input contains NaN
.I found that adding a check for nan in both embeddings could solve this problem.
I would like to have someone else look at it because I don't want to make any assumptions that the stop words should be part of the same segments.