openai / generating-reviews-discovering-sentiment

Code for "Learning to Generate Reviews and Discovering Sentiment"
https://arxiv.org/abs/1704.01444
MIT License
1.51k stars 379 forks source link

Opinion2vec? #61

Open youssefavx opened 4 years ago

youssefavx commented 4 years ago

I want to use this model to convert an opinion to a vector somehow (or at least arrive at an approximation of that).

E.g. "I'm an atheist" and "I believe in god" should be polar opposites. "I hate potatoes" and "I love potatoes" should be polar opposites.

Let's take "I hate potatoes" and "I love potatoes" for now since that seems easier.

If I do:

from encoder import Model

model = Model()

text = ['I hate potatoes', 'I love potaotes']
transformed = model.transform(text)

print('Vectors:', transformed)

print('Sentiment:', transformed[:, 2388])

I get:

Vectors: [[-0.21318194  0.2849596   0.00824564 ...  0.6167125   0.06999006
   0.00869564]
 [-0.10017553 -0.01693083 -0.00667866 ...  0.657396   -0.00200943
   0.00810469]]

Sentiment: [-0.16220786  0.305778  ]

I make the assumption that the above is a vector and the next one is the sentiment.

I assume this model somehow encodes some meaning about the text because when I use my numba cosine distance function on the 2 vectors:

from numba import jit
import numpy as np
import csv
import pandas as pd

@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu!=0 and vv!=0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

numba_distance = cosine_similarity_numba(transformed[0], transformed[1])

print(numba_distance)

I get:

0.8250403321263183

And the function gives me 1 when I set both sentences to "I love potatoes"

However, is it possible to somehow incorporate the sentiment and the semantic similarity into one?

Meaning given "I love potatoes" and "I hate potatoes" somehow measure the distance of that with its sentiment distance?

So that I end up with -1? or at least something like that?

Perhaps it could be something like "the greater the semantic similarity is (cosine similarity), the greater the effect the sentiment measurement has on the distance between those 2 points?

youssefavx commented 4 years ago

Here is what I have done so far:

from encoder import Model
import operator
from functools import reduce
import numpy as np
from numba import jit

@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu!=0 and vv!=0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

model = Model()

text = ['I cant stand god', 'I love god']

vectors = model.transform(text)
sentiment_scores= vectors[:, 2388]

semantic_similarity = cosine_similarity_numba(vectors[0], vectors[1])

#Here I am subtracting all numbers in the list of "sentiment_scores" (only 2) to get the 'sentiment distance'
sentiment_distance = reduce(operator.sub, sentiment_scores)

print('Semantic similarity? :', semantic_similarity)

#Here I'm using the semantic similarity as a 'weight' (The closer the two sentences are in meaning, the more the sentiment distance (subtracting the sentiment score of the 2 sentences) should be higher / 'taken more seriously'
print('Distance (incorporating sentiment):', sentiment_distance * semantic_similarity)

Result:

2.089 seconds to transform 2 examples
Semantic similarity? : 0.9349196788842933
Distance (incorporating sentiment): -0.5018387177574432

While this gives me a single number that incorporates the distance, I would prefer a vector that represents both the sentiment and semantic similarity somehow if possible as it would allow me to calculate this sort of inferred 'opinion distance' with many other vectors.

youssefavx commented 4 years ago

Okay instead of using the vector from sentiment neuron, when I use SentenceBERT I get much higher accuracy it seems. I think a combination of these 2 models might get me an approximation of what I'm looking for.

The big question for me is: how do I get a vector that combines these 2 things: a vector representation from SentenceBERT, and a vector representation from sentiment neuron - such that the end result (viewpoint similarity) is similar to the above.

This would be useful because I would then be able to do more things like clustering and so on.