This project implements various measures that assess the linguistic and stylistic complexity of (literary) texts. There are surface-based, sentence-based, pos-based, dependency-based and constituency-based measures. Most of the measures are language independent, but some of them rely on language-specific information (see language definition files) or are only defined for German (this affects some of the constituency-based measures).
The easiest way to install the toolbox is via pip (pip3 in some distributions):
pip install textcomplexity
Alternatively, you can download and decompress the latest release or clone the git repository:
git clone https://github.com/tsproisl/textcomplexity.git
In the new directory, run the following command:
python3 setup.py install
You can use the script bin/txtcomplexity
to compute (a sensible
subset of) all implemented complexity measures from the command line.
The script currently supports two input formats: The widely used
CoNLL-U format
(--input-format conllu
) and a custom tab-separated input format
(--input-format tsv
).
The CoNLL-U format consists of ten tab-separated columns that encode,
among other things, the dependency structure of the sentence. Missing
values can be represented by an underscore (_
). Here is an example:
# sent_id = hdt-s469
# text = Netscape hatte den Browser-Markt noch 1994 zu fast 90 Prozent beherrscht .
1 Netscape Netscape PROPN NE _ 11 nsubj _ _
2 hatte haben AUX VAFIN _ 11 aux _ _
3 den den DET ART _ 4 det _ _
4 Browser-Markt Markt NOUN NN _ 11 obj _ _
5 noch noch ADV ADV _ 6 advmod _ _
6 1994 1994 NUM CARD _ 11 obl _ _
7 zu zu ADP APPR _ 10 case _ _
8 fast fast ADV ADV _ 9 advmod _ _
9 90 90 NUM CARD _ 10 nummod _ _
10 Prozent Prozent NOUN NN _ 11 obl _ _
11 beherrscht beherrschen VERB VVPP _ 0 root _ _
12 . . PUNCT $. _ 11 punct _ _
If you want to compute the constituency-based complexity measures, the
input should be in a custom tab-separated format with six
tab-separated columns and an empty line after each sentence. The six
columns are: word index, word, part-of-speech tag, index of dependency
head, dependency relation, phrase structure tree. Missing values can
be represented by an underscore (_
). Here is a short example with
two sentences:
1 Das ART 3 NK (TOP(S(NP*
2 fremde ADJA 3 NK *
3 Schiff NN 4 SB *)
4 war VAFIN -1 -- *
5 nicht PTKNEG 6 NG (AVP*
6 allein ADV 4 MO *)
7 . $. 6 -- *))
1 Sieben CARD 2 NK (TOP(S(NP*
2 weitere ADJA 3 MO *)
3 begleiteten VVFIN -1 -- *
4 es PPER 3 OA *
5 . $. 4 -- *))
Without any further options, the script computes a sensible subset of all applicable measures (see below):
txtcomplexity --input-format conllu <file>
The script automatically includes measures that rely on
language-specific information, if you specify the input language. If
your texts are in German or English, you can use --lang de
or
--lang en
. If your texts are in another language, use --lang other --lang-def <file>
to provide a custom language definition
file.
If you want to compute more (or fewer) measures, indicate one of the
predefined sets of measures (via --preset
). You can choose to ignore
punctuation (--ignore-punct
) or case (--ignore-case
) and set the
window-size for the surface-based measures (--window-size
). By
default, the script formats its output as JSON but you can also
request tab-separated values suitable for import in a spreadsheet
(--output-format tsv
). More detailed usage information is available
via:
txtcomplexity -h
Getting the input format right can sometimes be a bit tricky.
Therefore, we provide a simple wrapper script around
stanza, a state-of-the-art
NLP pipeline, which you can find in the utils/
subdirectory of this
repository.
First, you need to install stanza:
pip install stanza
Now you can use the wrapper script to parse your text files:
run_stanza.py --language <language> --output-dir <directory> <file> …
In our article on lexical complexity (currently in preparation) we argue that there are several distinct aspects (or dimensions) of lexical complexity and we propose a single measure for each of the dimensions. Most of them are implemented here.
All of these measures correlate perfectly.
Michéa's M is the reciprocal of Sichel's S
Yule's K, Simpson's D and Herdan's Vm correlate perfectly. Simpson's D is perhaps the most intuitive of the three measures and can be interpreted as the probability of two randomly drawn tokens from the text being identical
Measures of dispersion:
DP/DPnorm and KL-divergence require an additional parameter (the number of parts in which to split the text), therefore they are not computed in the command-line script.
Language-specific measures relying on a list of part-of-speech tags that indicate punctuation, see language definition files:
These measures rely on language-specific information (lists of part-of-speech tags that indicate open word classes and proper names and lists of the most common word-tag pairs in a reference corpus), see language definition files.
Language-independent measures:
Language-dependent measures (defined for the German NEGRA parsing scheme):
Some complexity measures (e.g. lexical density and rarity) require
language specific information that needs to be provided by language
definition files. For German and English, the built-in language
definition files will be used automatically (as long as you indicate
the language via the --lang
option). For other languages (--lang other
), you need to provide the language definition files yourself.
Language definition files are in JSON format and contain the following
information:
language
: Language codepunctuation
: List of language-specific part-of-speech tags used
for punctuation (column XPOS in CoNLL-U format)proper_names
: List of language-specific part-of-speech tags used
for proper namesopen_classes
: List of language-specific part-of-speech tags used
for open word classes (including proper names)most_common
: List of the most frequent content words (excluding
proper names) and their part-of speech tags; for German and
English, we use the 5.000 most frequent words according to the
COW frequency
listsHere is an excerpt from the German language definition file (omitting most of the 5.000 most common content words):
{"language": "de",
"punctuation": ["$.", "$,", "$("],
"proper_names": ["NE"],
"open_classes": ["ADJA", "ADJD", "ITJ", "NE", "NN", "TRUNC", "VVFIN", "VVIMP", "VVINF", "VVIZU", "VVPP"],
"most_common": [["gibt", "VVFIN"],
["gut", "ADJD"],
["Zeit", "NN"],
…
["Fahrzeugen", "NN"],
["Kopie", "NN"],
["Merkmale", "NN"]
]
}
Here is an excerpt from the English language definition file (omitting most of the 5.000 most common content words). Note that the part-of-speech tags for punctuation look like punctuation symbols – but we list pos tags, not punctuation symbols:
{"language": "en",
"punctuation": [".", "," ,":", "\"", "``", "(", ")", "-LRB-", "-RRB-"],
"proper_names": ["NNP", "NNPS"],
"open_classes": ["AFX", "JJ", "JJR", "JJS", "NN", "NNS", "RB", "RBR", "RBS", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"],
"most_common": [["is", "VBZ"],
["be", "VB"],
["was", "VBD"],
…
["statistical", "JJ"],
["appearing", "VBG"],
["recipes", "NNS"]
]
}