shangjingbo1226 / AutoPhrase

AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Apache License 2.0
1.17k stars 272 forks source link
automatic compound-words lexicon multi-language phrase quality-phrases text-mining

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Publications

Please cite the following two papers if you are using our tools. Thanks!

Recent Changes

2020.06.14

2018.03.04

2017.10.23

New Features

(compared to SegPhrase)

Related GitHub Repositories

Requirements

Linux or MacOS with g++ and Java installed.

Ubuntu:

MacOS:

Default Run

Phrase Mining Step

$ ./auto_phrase.sh

The default run will download an English corpus from the server of our data mining group and run AutoPhrase to get 3 ranked lists of phrases as well as 2 segmentation model files under the MODEL (i.e., models/DBLP) directory.

You can change RAW_TRAIN to your own corpus and you may also want change MODEL to a different name.

Phrasal Segmentation

We also provide an auxiliary function to highlight the phrases in context based on our phrasal segmentation model. There are two thresholds you can tune in the top of the script. The model can also handle unknown tokens (i.e., tokens which are not occurred in the phrase mining step's corpus).

In the beginning, you need to specify AutoPhrase's segmentation model, i.e., MODEL. The default value is set to be consistent with auto_phrase.sh.

$ ./phrasal_segmentation.sh

The segmentation results will be put under the MODEL directory as well (i.e., model/DBLP/segmentation.txt). The highlighted phrases will be enclosed by the phrase tags (e.g., <phrase>data mining</phrase>).

Incorporate Domain-Specific Knowledge Bases

If domain-specific knowledge bases are available, such as MeSH terms, there are two ways to incorporate them.

Handle Other Languages

Tokenizer and POS tagger

In fact, our tokenizer supports many different languages, including Arabics (AR), German (DE), English (EN), Spanish (ES), French (FR), Italian (IT), Japanese (JA), Portuguese (PT), Russian (RU), and Chinese (CN). If the language detection is wrong, you can also manually specify the language by modify the TOKENIZER command in the bash script auto_phrase.sh using the two-letter code for that language. For example, the following one forces the language to be English.

TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l EN"

We also provide a default tokenizer together with a dummy POS tagger in the tools/tokenizer. It uses the StandardTokenizer in Lucene, and always assign a tag UNKNOWN to each token. To enable this feature, please add the -l OTHER" to the TOKENIZER command in the bash script auto_phrase.sh.

TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l OTHER"

If you want to incorporate your own tokenizer and/or POS tagger, please create a new class extending SpecialTagger in the tools/tokenizer. You may refer to StandardTagger as an example.

stopwords.txt

You may try to search online or create your own list.

wiki_all.txt and wiki_quality.txt

Meanwhile, you have to add two lists of quality phrases in the data/OTHER/wiki_quality.txt and data/OTHER/wiki_all.txt. The quality of phrases in wiki_quality should be very confident, while wiki_all, as its superset, could be a little noisy. For more details, please refer to the tools/wiki_enities.

Use an already tokenized/preprocessed and POS tagged corpus

You can also use AutoPhrase with an already tokenized and tagged corpus. For this, you need to:

phrasal_segmentation.sh time java $TOKENIZER -m direct_test -i $TEXT_TO_SEG -o $TOKENIZED_TEXT_TO_SEG -t $TOKEN_MAPPING -c N -thread $THREAD -delimiters "\n\t "


Note also that, by using such custom input, you can lemmatize or stemm your tokens beforehand and keep the already computed POS tags unchanged.

## Docker

### Default Run

sudo docker run -v $PWD/models:/autophrase/models -it \ -e ENABLE_POS_TAGGING=1 \ -e MIN_SUP=30 -e THREAD=10 \ remenberl/autophrase

./auto_phrase.sh


The results will be available in the ```models``` folder. Note that all of the environment variables above have their default values--leaving the assigments out here would produce exactly the same results.  (However, in this case, using default values, the results of ```phrasal_segmentation.txt``` would be saved to the
 internal ```default_models``` directory--this is unavoidable, since the phrasal segmentation app reads from and writes to the same model directory.)

### User Specified Input

Assuming the path to input file is ./data/input.txt.

sudo docker run -v $PWD/data:/autophrase/data -v $PWD/models:/autophrase/models -it \ -e RAW_TRAIN=data/input.txt \ -e ENABLE_POS_TAGGING=1 \ -e MIN_SUP=30 -e THREAD=10 \ -e MODEL=models/MyModel \ -e TEXT_TO_SEG=data/input.txt \ remenberl/autophrase

./auto_phrase.sh



"RAW_TRAIN" is the training corpus, and "TEXT_TO_SEG" is a corpus whose phrases are to be highlighted--typically, this is the same corpus, but training and phrasal segmentation use two different scripts.  When the user wants to segment a new corpus with an existing model, only the latter script need be used (and setting "RAW_TRAIN" isn't necessary).

Note that, in a Docker deployment, the (default) ```data``` and ```models``` directories are renamed to ```default_data``` and ```default_models```, respectively, to avoid conflicts with
mounted external directories with the same names. It should be noted as well that there's litle point in saving a model to the default models directory, since all new files are erased when
the container is exited (and if an external directory is mounted as "models", and no value is specified for "MODEL", the results will be saved in the "models/DBP" subdirectory). The same 
wrinkle also means that there's little point to running a container with the "FIRST_RUN" variable set to 0.

Because the original data directory will have been been renamed, it's perfectly fine for the user to mount an external directory called "data" and read the corpus from there--and in most 
cases, there's no need for a user to change the supplied files stored in the default data directory. If such a change is necessary, though, the environment variable that specifies the
directory in question is "DATA_DIR".

### In Windows

The ```sudo``` command won't work in a Windows bash shell, and in any case isn't needed in an elevated window--replace it with ```winpty```.

In addition, the ```PWD``` variable works a little oddly in MinGW (the Git bash shell), appending ";C" to the end of the path. To prevent this, replace ```$PWD/models:/autophrase/models``` with ```"/${PWD}/models":/autophrase/models```, and ```$PWD/data/autophrase/data``` with ```"/${PWD}/data:/autophrase/data```.