tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link
english german sentence-splitter social-media tokenizer

SoMaJo

PyPI Build

Introduction

echo 'Wow, superTool!;)' | somajo-tokenizer -c -
Wow
,
super
Tool
!
;)

SoMaJo is a rule-based tokenizer and sentence splitter that implements tokenization guidelines for German and English. It has a strong focus on web and social media texts (it was originally created as the winning submission to the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media) and is particularly well-suited to perform tokenization on all kinds of written discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues. Of course it also works on more formal texts.

Version 1 of the tokenizer is described in greater detail in Proisl and Uhrig (2016).

For part-of-speech tagging (in particular of German web and social media texts), we recommend SoMeWeTa:

somajo-tokenizer --split_sentences <file> | somewe-tagger --tag <model> -

Features

Installation

SoMaJo can be easily installed using pip (pip3 in some distributions):

pip install -U SoMaJo

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/SoMaJo.git

In the new directory, run the following command:

pip install -U .

Usage

Using the somajo-tokenizer executable

You can use the tokenizer as a standalone program from the command line. General usage information is available via the -h option:

somajo-tokenizer -h
usage: somajo-tokenizer [-h] [-l {en_PTB,de_CMC}]
                        [-s {single_newlines,empty_lines}] [-x] [--tag TAG]
                        [--prune PRUNE] [--strip-tags] [-c]
                        [--split_sentences] [--sentence_tag SENTENCE_TAG] [-t]
                        [-e] [--parallel N] [-v]
                        FILE

A tokenizer and sentence splitter for German and English texts. Currently, two
tokenization guidelines are implemented: The EmpiriST guidelines for German
web and social media texts (de_CMC) and the "new" Penn Treebank conventions
for English texts (en_PTB).

positional arguments:
  FILE                  The input file (UTF-8-encoded) or "-" to read from
                        STDIN.

options:
  -h, --help            show this help message and exit
  -l {en_PTB,de_CMC}, --language {en_PTB,de_CMC}
                        Choose a language. Currently supported are German
                        EmpiriST-style tokenization (de_CMC) and English Penn-
                        Treebank-style tokenization(en_PTB). (Default: de_CMC)
  -s {single_newlines,empty_lines}, --paragraph_separator {single_newlines,empty_lines}
                        How are paragraphs separated in the input text? Will
                        be ignored if option -x/--xml is used. (Default:
                        empty_lines)
  -x, --xml             The input is an XML file. You can specify tags that
                        always constitute a sentence break (e.g. HTML p tags)
                        via the --tag option.
  --tag TAG             Start and end tags of this type constitute sentence
                        breaks, i.e. they do not occur in the middle of a
                        sentence. Can be used multiple times to specify
                        multiple tags, e.g. --tag p --tag br. Implies option
                        -x/--xml. (Default: --tag title --tag h1 --tag h2
                        --tag h3 --tag h4 --tag h5 --tag h6 --tag p --tag br
                        --tag hr --tag div --tag ol --tag ul --tag dl --tag
                        table)
  --prune PRUNE         Tags of this type will be removed from the input
                        before tokenization. Can be used multiple times to
                        specify multiple tags, e.g. --tag script --tag style.
                        Implies option -x/--xml. By default, no tags are
                        pruned.
  --strip-tags          Suppresses output of XML tags. Implies option
                        -x/--xml.
  -c, --split_camel_case
                        Split items in written in camelCase (excluding
                        established names and terms).
  --split_sentences, --split-sentences
                        Also split the input into sentences.
  --sentence_tag SENTENCE_TAG, --sentence-tag SENTENCE_TAG
                        Tag name for sentence boundaries (e.g. --sentence_tag
                        s). If this option is specified, sentences will be
                        delimited by XML tags (e.g. <s>…</s>) instead of empty
                        lines. This option implies --split_sentences
  -t, --token_classes   Output the token classes (number, XML tag,
                        abbreviation, etc.) in addition to the tokens.
  -e, --extra_info      Output additional information for each token:
                        SpaceAfter=No if the token was not followed by a space
                        and OriginalSpelling="…" if the token contained
                        whitespace.
  --character-offsets   Output character offsets in the input for each token.
  --parallel N          Run N worker processes (up to the number of CPUs) to
                        speed up tokenization.
  -v, --version         Output version information and exit.

Here are some common use cases:

Using the module

Take a look at the API documentation.

You can incorporate SoMaJo into your own Python projects. All you need to do is importing somajo, creating a SoMaJo object and calling one of its tokenizer functions: tokenize_text, tokenize_text_file, tokenize_xml or tokenize_xml_file. These functions return a generator that yields tokenized chunks of text. By default, these chunks of text are sentences. If you set split_sentences=False, then the chunks of text are either paragraphs or chunks of XML. Every tokenized chunk of text is a list of Token objects.

Here is an example for tokenizing and sentence splitting two paragraphs:

from somajo import SoMaJo

tokenizer = SoMaJo("de_CMC", split_camel_case=True)

# note that paragraphs are allowed to contain newlines
paragraphs = ["der beste Betreuer?\n-- ProfSmith! : )",
              "Was machst du morgen Abend?! Lust auf Film?;-)"]

sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")
    print()

And here is an example for tokenizing and sentence splitting a whole file. The option paragraph_separator="single_newlines" states that paragraphs are delimited by newlines instead of empty lines:

sentences = tokenizer.tokenize_text_file("Beispieldatei.txt", paragraph_separator="single_newlines")
for sentence in sentences:
    for token in sentence:
        print(token.text)
    print()

For processing XML data, use the tokenize_xml or tokenize_xml_file methods:

eos_tags = ["title", "h1", "p"]

# you can read from an open file object
sentences = tokenizer.tokenize_xml_file(file_object, eos_tags)
# or you can specify a file name
sentences = tokenizer.tokenize_xml_file("Beispieldatei.xml", eos_tags)
# or you can pass a string with XML data
sentences = tokenizer.tokenize_xml(xml_string, eos_tags)

for sentence in sentences:
    for token in sentence:
        print(token.text)
    print()

Evaluation

SoMaJo was the system with the highest average F₁ score in the EmpiriST 2015 shared task. The performance of the current version on the two test sets is summarized in the following table (Training and test sets are available from the official website):

Corpus Precision Recall F₁
CMC 99.71 99.56 99.64
Web 99.91 99.92 99.91

Tokenizing English text

SoMaJo can also tokenize English text. In general, we follow the “new” Penn Treebank conventions described, for example, in the guidelines for ETTB 2.0 (Mott et al., 2009) and CLEAR (Warner et al., 2012).

For tokenizing English text on the command line, specify the language via the -l or --language option:

somajo-tokenizer -l en_PTB <file>

From Python, you can pass language="en_PTB" to the SoMaJo constructor, e.g.:

paragraphs = ["That aint bad!:D"]
tokenizer = SoMaJo(language="en_PTB")
sentences = tokenizer.tokenize_text(paragraphs)

Performance of the English tokenizer:

Corpus Precision Recall F₁
English Web Treebank 99.66 99.64 99.65

Development

Here are some brief notes to help you get started:

References

If you use SoMaJo for academic research, please consider citing the following paper: