Closed ms8r closed 7 years ago
@shiAs wrote in #1 :
In fact, the phrases that I am looking for, are just those phrases in training examples (MPQA corpus), like the following document that I picked up from dataset...
OK - unless I'm overlooking something that leaves the question how we'll identify and extract those phrases from the corpus and how we'll assign prior polarity? I'll have a look if we can use expressive subjectivity annotations in the corpus that are limited to short spans (bi-grams) over the weekend.
Overall I wonder how much you'll gain by this though? It seems additional "signal" would only come from sentences in which either no other single-word subjectivity clue is present or in which such a single-word subjectivity clue is independent from the polarity expressed in the bi-gram. In any case, it#s worth a try ;-)
Well, I think it is exactly what you said. sometimes the polarity of a single word independent of the sentence it appears in, will be different from the polarity of it when appears in a bi-gram, tri-grams,... .
Thanks Shima
In fact my problem is to extracting the annotated data of MPQA corpus (epressive subjectivity, direct-subjective annotations), no matters if they are single-words or bi-grams...
Hi @shiAs ,
OK - check out the brach annotations
in this repo which has the files mpqa_annot.xlsx
and mpqa_annot.json.gz
. The content is the same in either format: one record for each subjectivity expression flagged in the corpus annotations (independent of the single-token subjectivity clues list). Each record includes the original text snippet the annotation refers to (field ref
) plus polarity, intensity, etc. from the corresponding annotation. It also includes a field with the length of the text snippet. You can filter / sort by these and hone in on bi-grams, tri-grams or whatever else. On a cursory scan, only a relatively small subset of these would probably make useful stand-alone subjectivity clues but there are some nice subtelties in there (e.g. "once prosperous" which as a phrase almost certainly implies negative sentiment).
Let me know if this get you closer to what you need.
Well, That's great. It is exactly what I was looking for. But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.
Thank you very much for the help.
On Wed, Mar 23, 2016 at 8:59 PM, Markus Schweitzer <notifications@github.com
wrote:
OK - check out the brach annotations in this repo which has the files mpqa_annot.xlsx and mpqa_annot.json.gz. The content is the same in either format: one record for each subjectivity expression flagged in the corpus annotations (independent of the single-token subjectivity clues list). Each record includes the original text snippet the annotation refers to (field ref) plus polarity, intensity, etc. from the corresponding annotation. It also includes a field with the length of the text snippet. You can filter / sort by these and hone in on bi-grams, tri-grams or whatever else. On a cursory scan, only a relatively small subset of these would probably make useful stand-alone subjectivity clues but there are some nice subtelties in there (e.g. "once prosperous" which as a phrase almost certainly implies negative sentiment).
Let me know if this get you closer to what you need.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-200519475
Best Regards, Shima Asaadi,
@shiAs wrote:
But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.
In bi-gram_no_stop.json
(branch annotations
) you'll find the unique list of all bi-grams (extracted from the full set of annotated phrases) which do not overlap with any of the stop words in short_stop.json
. These are just under 2,500 items -- if you dump these into a spradsheet and manually annotate the ones that express subjectivity you could probably do this in less than 2 hours.
Longer phrases (>= 3) seem less useful as clues, hence your biggest gain would probably come from focusing on bi-grams.
If you do end up annotating these please share back so we can include them with the subjectivity clues.
That's OK. I will do it as soon as possible and share with you. Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?
On Thu, Mar 24, 2016 at 4:14 PM, Markus Schweitzer <notifications@github.com
wrote:
@shiAs https://github.com/shiAs wrote:
But, as you mentioned there is no prior polarity for this expressions! I am thinking about how to use them in a useful way.
In bi-gram_no_stop.json (branch annotations) you'll find the unique list of all bi-grams (extracted from the full set of annotated phrases) which do not overlap with any of the stop words in short_stop.json. These are just under 2,500 items -- if you dump these into a spradsheet and manually annotate the ones that express subjectivity you could probably do this in less than 2 hours.
Longer phrases (>= 3) seem less useful as clues, hence your biggest gain would probably come from focusing on bi-grams.
If you do end up annotating these please share back so we can include them with the subjectivity clues.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-200881507
Best Regards, Shima Asaadi,
@shiAs wrote:
Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?
Yes, even if you just give it a simple +
or -
you'll probably get something quite useful (I guess that only about 1/4 of the phrases carry a stand-alone polarity and need to be tagged). As far as I can tell the "official" MPQA annotations were done by students and if you look through them you find quite a few questionable ones...
Hey Markus,
I just realized something! When we work with phrase-level sentiment analysis, I think it is not much wise to remove stop words for prior polarity! it differs from 1-word sentiment analysis. lets consider "don't" : don't be happy/worry! or be happy/worry! "so" : not so perfect! or not perfect!
Does it make sense for you?!
On Fri, Mar 25, 2016 at 7:59 PM, Markus Schweitzer <notifications@github.com
wrote:
@shiAs https://github.com/shiAs wrote:
Just, do you mean I do the annotations based on my own idea of the subjectivity on the phrases?
Yes, even if you just give it a simple + or - you'll probably get something quite useful (I guess that only about 1/4 of the phrases carry a stand-alone polarity and need to be tagged). As far as I can tell the "official" MPQA annotations were done by students and if you look through them you find quite a few questionable ones...
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/2#issuecomment-201427501
Best Regards, Shima Asaadi,
I agree in principal but think you have to strike a balance here. If you do no stop word pruning at all you end up with about 10x as many candidates for phrase-level subjectivity clues. A cursory scan of those shows that most do not indicate any subjectivity expression ("at work"). Hence I iterated, starting with an empty stop word list and then successively adding those that seemed to cause a high number of "false positives". By doing so I probably also have suppressed a few valid phrase-level subjectivity clues. In the end it is a kind of pragmatic manual tuning to filter a signal out of the noise - assuming you're not in a mood to manually tag 25,000+ phrases :-)
The stop word list I ended up using is actually quite compact and much leaner than the usual "standard" stop word lists (which usually have >100 entries) :
short_stop.json:
---------------
["to", "the", "a", "am", "is", "in", "to", "and", "had", "do", "was",
"are", "could", "will", "so", "of", "don't", "that", "would", "as", "can",
"no", "at", "not", "has", "be", "should", "his", "her", "he", "him", "from",
"with", "for", "by", "or", "where", "with", "this", "were"]
I've deliberately taken out negations (no, not, don't) as you will need to assess these through a different mechanism in the text you're analyzing. Similarly, I have taken out "so" which -- if used as an amplifier -- normally requires the token it amplifies to be a valid subjectivity clue by itself ("so good"). There are exceptions in colloquial language ("so eighties") but I doubt you would find many of those in the MPQA corpus in the first place.
If you need the full list (without any stop word pruning) let me know and I can re-run it without the stop words. Be prepared though to plough through a long list of "I am", "should we", ... :-)
Extend list of subjectivity clues from single words to bi-grams