rubenIzquierdo / opinion_miner_deluxePP

Opinion miner (opinion expressions, targets and holders) based on machine learning
GNU General Public License v2.0
9 stars 4 forks source link

Opinion miner deluxe ++

Introduction

The opinion miner deluxe "plus plus" is an improved version of the opinion miner based on machine learning that can be trained using a list of KAF/NAF files. It is important to notice that the opinion miner module will not call to any external module to obtain features. It will read all the features from the input KAF/NAF file, so you have to make sure that your input file contains all the required information in advance (tokens, terms, polarities, constituents, entitiess, dependencies...). This is an example that the opinion miner would extract from the sentence I said that the hotel is nice, but the staff is the best!!:

    <opinion oid="o2">
      <opinion_holder>
        <!-- I -->
        <span>
          <target id="t1"/>
        </span>
      </opinion_holder>
      <opinion_target>
        <!-- staff -->
        <span>
          <target id="t10"/>
        </span>
      </opinion_target>
      <opinion_expression polarity="DSE" strength="1">
        <!-- the best !! -->
        <span>
          <target id="t12"/>
          <target id="t13"/>
          <target id="t14"/>
        </span>
      </opinion_expression>
    </opinion>

The task is general divided into 2 steps

Quick Installation

To install this software, just follow these steps:

The install_me.sh will download and compile the required dependencies, and it will also download the trained models. You will be asked for a password during the download process. To obtain the password please mail me (you will find my contact details at the end of this documentation) To check if the installation was correct you can run this command:

cat example_en.naf | tag_file.py -d hotel

You should get a NAF file with opinions in the output. The script will assume to find the models in a predefined path, you can also specify the path to the folder where you have the models with the option -f:

cat example_en.naf | tag_file.py -f path/to/my_model/

Usage of the opinion tagger

The main script for tagging opinions is tag_file.py. It takes a KAF/NAF file as input stream and it has some parameters. You can see the parameters by running:

tag_file.py -h
usage: tag_file.py [-h] [-v] (-d DOMAIN | -f PATH_TO_FOLDER) [-log]

Detects opinions in KAF/NAF files

optional arguments:
  -h, --help         show this help message and exit
  -v, --version      show program's version number and exit
  -d DOMAIN          Domain for the model (hotel,news)
  -f PATH_TO_FOLDER  Path to a folder containing the model
  -log               Show log information
  -polarity          Run the polarity (positive/negative) classifier too

Example of use: cat example.naf | tag_file.py -d hotel -polarity

Description of the internal process

In next subsections, a brief explanation of the 2 steps is given.

Opinion Entity detection

The first step when extracting opinions from text is to determine which portions of text represent the different opinion entities:

In order to do this, three different Conditional Random Fields (CRF) classifiers have been trained using by default this set of features: tokens, lemmas, part-of-speech tags, constituent labels and polarity of words and entities. These classifiers detect portions of text representeing differnet opinion entities.

Opinion Entity linking

This step takes as input the opinion entities detected in the previous step, and links them to create the final opinions <expression/target/holder>. In this case we have trained two binary Support Vector Machines (SVM), one that indicates the degree of association between a given target and a given expression, and another one that gives the degree of linkage between a holder and an opinion expression. So given a list of expressions, a list of targets and holders detected by the CRF classifiers, the SVM models try to select the best candidate from the target list for each expressions, and the best holder from the holder list, to create the final opinion triple.

Considering a certain opinion expression and a target, these are the features by default used to represent this data for the SVM engine:

1) Textual features: tokens and lemmas of the expression and the target 2) Distance features: features representing the relative distance of both elements in the text (normalized to a discrete list of possible values: far/medium/close for instance), and if both elements are in the same sentence or not 3) Dependency features: to indicate the dependency relations between the two elements in the text (dependency path, and dependencies relations with the root of the sentence)

Training

To be completed...

Contact