ms8r / mpqa

Processing the MPQA Corpus
27 stars 8 forks source link

JSON serialization does not work #1

Open ms8r opened 9 years ago

ms8r commented 9 years ago

DataFrame cannot be serialized to JSON. Assume gets tripped by text tokens / insufficient escaping

sasaadi commented 8 years ago

Dear Author,

Could you please explain how to unpickle the target file and obtain the information?

Thank you very much, Shima

ms8r commented 8 years ago

Hi,

You will need the Python module Pandas plus its dependencies (most notably NumPy) installed. By far the easiest way to do this is via the Anaconda Python distribution.

With Pandas installed you can simply do the following:

import pandas as pd

feat = pd.read_pickle('mpqa_features.pickle')

feat will be a Pandas DataFrame, a 2-dimensional labelled data structure.

Thanks, Markus

sasaadi commented 8 years ago

Dear Markus,

Thank you very much for your reply.

I managed to transfer data to a *.mat file as I need it in MATLAB. Also I asked about some problems, then I realized the mistake, now the code is working :)

Just I would like to know if each subjectivity clue contains just 1 word?(as I see in the results), or there are some phrases as clues so that you used before and after words??

(In MPQA corpus, it says that in expressive-subjectivity annotation there are words and phrases that express the state.)

Thank you very much, Shima

sasaadi commented 8 years ago

Hi,

Any idea? :)

ms8r commented 8 years ago

Hi Shima,

Glad you got it to work and converted to MatLab. To your question: The subjectivity clues ('SC')are all single words only (see file subjclues.tff in the repo). The way the feature data set was constructed was to take each occurence of one the SCs in the annotated MPQA docs and then capture the polarity of the phrase the SC appears in based on the annotation in the MPQA corpus. The corresponding record in the feature data set (indexed by document identifier and a running count within the doc) would include

This was intended as training data for a classifier into which you could feed some editorial type text and the classifier would identify Named Entities in the text and provide a score for each NE, indicating whether it was mentioned positively or negatively.

The file Train_and_Test.ipynb in the repo is an iPython notebook that implements a simple Support Vector Machine classifier with this data. Cross validation actually gives good precision and recall scores (>75%).

Out of curiosity: would you mind sharing what you plan to use the data for? I haven't found the time yet to develop this further but it's good to hear that it is (hopefully) useful for someone else.

Thanks, Markus

sasaadi commented 8 years ago

Dear Markus,

Thank you very much for the reply.

Now I know the procedure. The reason that I am looking for phrase level polarity, is that:

I am doing a research in phrase-level sentiment analysis. The target corpus is MPQA. So, in training phase using MPQA dataset, there are some phrase level annotations in expressive-subjectivity and direct-subjectivity. (for example the phrase "turned down" with negative sentiment and low intensity) and I should consider them in the training phase.

I tried to modify the code in a way to extract those phrases with their polarity and intensity as training examples (in addition to SCs) and add them to the pickle file. I mean, beside "subjclues.tff" words, I need to save those phrases appeared in the corpus with their annotations in a document file. But, as I am not familiar with Python and the code, I encountered problem. I would appreciate any help in this problem :)

Thank you for providing this code. It is certainly useful for more than someone! Shima

ms8r commented 8 years ago

Do you have I a list of the phrases you're looking to include into the training set? I might have some time over the Easter holidays and it shouldn't be too hard to modify the code to also work with phrases (as long as these don't span sentence boundaries). Have been meaning to pick this up again anyway and this would be a good opportunity.

sasaadi commented 8 years ago

In fact, the phrases that I am looking for, are just those phrases in training examples (MPQA corpus), like the following document that I picked up from dataset:

 "Amman -- Jordan, bound by environmental clauses in the Free Trade

Agreement (FTA) with the US, is currently mulling over signing the Kyoto Protocol on global warming even though Washington turned down the protocol six months ago. Having clauses on the protection of environmental and labour rights in the body of the FTA does not stop the government from signing environmental agreements refused by the US, according to Minister of Municipal and Rural Affairs and the Environment Abdul Razzaq Tbeishat. He explained to The Jordan Times that "in Jordan, we conclusively believe that development cannot be achieved unless the environment is protected." Certain steps have to be taken with regard to this objective -- including signing the Kyoto deal, he said. ....."

for highlighted phrases there are (direct-subjectivity or expressive-subjectivity) annotations (polarity and intensity) in MPQA corpus and I would just like to extract them from dataset and use them in a same way as SCs (I mean to convert them from a pickle file to .mat file and so on...) and i have no prior polarity for them!!.

I really would appreciate your help, Thanks, Shima

On Sun, Mar 20, 2016 at 5:20 PM, Markus Schweitzer <notifications@github.com

wrote:

Do you have I a list of the phrases you're looking to include into the training set? I might have some time over the Easter holidays and it shouldn't be too hard to modify the code to also work with phrases (as long as these don't span sentence boundaries). Have been meaning to pick this up again anyway and this would be a good opportunity.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/ms8r/mpqa/issues/1#issuecomment-198956798

Best Regards, Shima Asaadi,

ms8r commented 8 years ago

Opening a new issue #2 for this topic