Understanding most_predictive_term_in_doc()

iit-cs579 / main

CS579: Online Social Network Analysis at the Illinois Institute of Technology

147 stars 204 forks source link

Understanding most_predictive_term_in_doc() #115

Closed morlowsk closed 8 years ago

morlowsk commented 8 years ago

I am having trouble coding out this function because I don't understand what would signify either a positive term in a document and a negative term in a document. I know the clf has its coef_ field that will give me the coefficients for all the terms in our vocabulary, but I don't know whether positive coefficients signify positive sentiment and whether negative coefficients signify negative sentiment.

aronwc commented 8 years ago

positive coefficients -> positive sentiment

On Fri, Nov 6, 2015 at 2:43 PM, morlowsk notifications@github.com wrote:

I am having trouble coding out this function because I don't understand what would signify either a positive term in a document and a negative term in a document. I know the clf has its coef_ field that will give me the coefficients for all the terms in our vocabulary, but I don't know whether positive coefficients signify positive sentiment and whether negative coefficients signify negative sentiment.

— Reply to this email directly or view it on GitHub https://github.com/iit-cs579/main/issues/115.

morlowsk commented 8 years ago

Hmm, well I don't understand why I would get this result.

for document data/test/pos/10055_10.txt, the term most predictive of class 0 is can (index=652)
for document data/test/pos/10055_10.txt, the term most predictive of class 1 is best (index=492)

bpraveen92 commented 8 years ago

Try sorting them properly according to class label.

morlowsk commented 8 years ago

How do you sort terms by class label if they only have real valued coefficients? At the moment, I am just taking the maximum of the coefficients for the terms in the document if it's class_idx = 1, and the minimum otherwise. I don't understand how it would work for one label but not for the other.

bpraveen92 commented 8 years ago

ya that's what I meant.

aronwc commented 8 years ago

I don't think the term "can" appears in document data/test/pos/10055_10.txt, so perhaps something is wrong with how you're mapping words to indices?

Perhaps clearing memory and running from scratch will resolve some problems.

morlowsk commented 8 years ago

Yeah wiping memory clean and running it over again didn't help. Where do we map words to indices again? Things were just fine until this function.

AndrewLu1992 commented 8 years ago

Hi, coefficients in clf.coef_() is naturally corresponding to the terms learned in do_vectorize(). And at this stage each entry is a binary value. Rule out all terms that does not appear in the given document. Hope this can help you.

hparik11 commented 8 years ago

Hi,

Still I don't understand this most_predictive_term_in_doc() function. Could you please elaborate?

AndrewLu1992 commented 8 years ago

Coefficients of regression model learned by clf is one per word in the whole vocabulary. But in a specific document it just contains a small subset of the total vocabulary. You are going to find the most predictive words for this small subset.

hparik11 commented 8 years ago

Oh okay. Thank you.