ms8r / mpqa

Processing the MPQA Corpus
27 stars 8 forks source link

n-gram subjectivity clues #2

Closed ms8r closed 7 years ago

ms8r commented 8 years ago

Extend list of subjectivity clues from single words to bi-grams

ms8r commented 8 years ago

@shiAs wrote in #1 :

In fact, the phrases that I am looking for, are just those phrases in training examples (MPQA corpus), like the following document that I picked up from dataset...

OK - unless I'm overlooking something that leaves the question how we'll identify and extract those phrases from the corpus and how we'll assign prior polarity? I'll have a look if we can use expressive subjectivity annotations in the corpus that are limited to short spans (bi-grams) over the weekend.

Overall I wonder how much you'll gain by this though? It seems additional "signal" would only come from sentences in which either no other single-word subjectivity clue is present or in which such a single-word subjectivity clue is independent from the polarity expressed in the bi-gram. In any case, it#s worth a try ;-)

sasaadi commented 8 years ago

Well, I think it is exactly what you said. sometimes the polarity of a single word independent of the sentence it appears in, will be different from the polarity of it when appears in a bi-gram, tri-grams,... .

Thanks Shima

sasaadi commented 8 years ago

In fact my problem is to extracting the annotated data of MPQA corpus (epressive subjectivity, direct-subjective annotations), no matters if they are single-words or bi-grams...

ms8r commented 8 years ago

Hi @shiAs ,

OK - check out the brach annotations in this repo which has the files mpqa_annot.xlsx and mpqa_annot.json.gz. The content is the same in either format: one record for each subjectivity expression flagged in the corpus annotations (independent of the single-token subjectivity clues list). Each record includes the original text snippet the annotation refers to (field ref) plus polarity, intensity, etc. from the corresponding annotation. It also includes a field with the length of the text snippet. You can filter / sort by these and hone in on bi-grams, tri-grams or whatever else. On a cursory scan, only a relatively small subset of these would probably make useful stand-alone subjectivity clues but there are some nice subtelties in there (e.g. "once prosperous" which as a phrase almost certainly implies negative sentiment).

Let me know if this get you closer to what you need.

sasaadi commented 8 years ago

Well, That's great. It is exactly what I was looking for. But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.

Thank you very much for the help.

On Wed, Mar 23, 2016 at 8:59 PM, Markus Schweitzer <notifications@github.com

wrote:

OK - check out the brach annotations in this repo which has the files mpqa_annot.xlsx and mpqa_annot.json.gz. The content is the same in either format: one record for each subjectivity expression flagged in the corpus annotations (independent of the single-token subjectivity clues list). Each record includes the original text snippet the annotation refers to (field ref) plus polarity, intensity, etc. from the corresponding annotation. It also includes a field with the length of the text snippet. You can filter / sort by these and hone in on bi-grams, tri-grams or whatever else. On a cursory scan, only a relatively small subset of these would probably make useful stand-alone subjectivity clues but there are some nice subtelties in there (e.g. "once prosperous" which as a phrase almost certainly implies negative sentiment).

Let me know if this get you closer to what you need.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-200519475

Best Regards, Shima Asaadi,

ms8r commented 8 years ago

@shiAs wrote:

But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.

In bi-gram_no_stop.json (branch annotations) you'll find the unique list of all bi-grams (extracted from the full set of annotated phrases) which do not overlap with any of the stop words in short_stop.json. These are just under 2,500 items -- if you dump these into a spradsheet and manually annotate the ones that express subjectivity you could probably do this in less than 2 hours.

Longer phrases (>= 3) seem less useful as clues, hence your biggest gain would probably come from focusing on bi-grams.

If you do end up annotating these please share back so we can include them with the subjectivity clues.

sasaadi commented 8 years ago

That's OK. I will do it as soon as possible and share with you. Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?

On Thu, Mar 24, 2016 at 4:14 PM, Markus Schweitzer <notifications@github.com

wrote:

@shiAs https://github.com/shiAs wrote:

But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.

In bi-gram_no_stop.json (branch annotations) you'll find the unique list of all bi-grams (extracted from the full set of annotated phrases) which do not overlap with any of the stop words in short_stop.json. These are just under 2,500 items -- if you dump these into a spradsheet and manually annotate the ones that express subjectivity you could probably do this in less than 2 hours.

Longer phrases (>= 3) seem less useful as clues, hence your biggest gain would probably come from focusing on bi-grams.

If you do end up annotating these please share back so we can include them with the subjectivity clues.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-200881507

Best Regards, Shima Asaadi,

ms8r commented 8 years ago

@shiAs wrote:

Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?

Yes, even if you just give it a simple + or - you'll probably get something quite useful (I guess that only about 1/4 of the phrases carry a stand-alone polarity and need to be tagged). As far as I can tell the "official" MPQA annotations were done by students and if you look through them you find quite a few questionable ones...

sasaadi commented 8 years ago

Hey Markus,

I just realized something! When we work with phrase-level sentiment analysis, I think it is not much wise to remove stop words for prior polarity! it differs from 1-word sentiment analysis. lets consider "don't" : don't be happy/worry! or be happy/worry! "so" : not so perfect! or not perfect!

Does it make sense for you?!

On Fri, Mar 25, 2016 at 7:59 PM, Markus Schweitzer <notifications@github.com

wrote:

@shiAs https://github.com/shiAs wrote:

Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?

Yes, even if you just give it a simple + or - you'll probably get something quite useful (I guess that only about 1/4 of the phrases carry a stand-alone polarity and need to be tagged). As far as I can tell the "official" MPQA annotations were done by students and if you look through them you find quite a few questionable ones...

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-201427501

Best Regards, Shima Asaadi,

ms8r commented 8 years ago

I agree in principal but think you have to strike a balance here. If you do no stop word pruning at all you end up with about 10x as many candidates for phrase-level subjectivity clues. A cursory scan of those shows that most do not indicate any subjectivity expression ("at work"). Hence I iterated, starting with an empty stop word list and then successively adding those that seemed to cause a high number of "false positives". By doing so I probably also have suppressed a few valid phrase-level subjectivity clues. In the end it is a kind of pragmatic manual tuning to filter a signal out of the noise - assuming you're not in a mood to manually tag 25,000+ phrases :-)

The stop word list I ended up using is actually quite compact and much leaner than the usual "standard" stop word lists (which usually have >100 entries) :

short_stop.json:
---------------
["to", "the", "a", "am", "is", "in", "to", "and", "had", "do", "was", 
"are", "could", "will", "so", "of", "don't", "that", "would", "as", "can", 
"no", "at", "not", "has", "be", "should", "his", "her", "he", "him", "from", 
"with", "for", "by", "or", "where", "with", "this", "were"] 

I've deliberately taken out negations (no, not, don't) as you will need to assess these through a different mechanism in the text you're analyzing. Similarly, I have taken out "so" which -- if used as an amplifier -- normally requires the token it amplifies to be a valid subjectivity clue by itself ("so good"). There are exceptions in colloquial language ("so eighties") but I doubt you would find many of those in the MPQA corpus in the first place.

If you need the full list (without any stop word pruning) let me know and I can re-run it without the stop words. Be prepared though to plough through a long list of "I am", "should we", ... :-)