tsproisl / textcomplexity

Linguistic and stylistic complexity measures for (literary) texts
GNU General Public License v3.0
76 stars 12 forks source link

Question about using the package #2

Open dalone2 opened 3 years ago

dalone2 commented 3 years ago

Hello tsproisl,

Thank you for developing this comprehensive package for text complexity! This is not an issue about the package, but more of a question I have. I have very limited experience with Linux and command lines, and was wondering if you can give me an example about using the package via Jupiter's i-notebook? I have the text file in CoNLL-U format, and saw the cli file, but I'm very unfamiliar with the argument thing, and just couldn't make it work myself. Thank you!

tsproisl commented 3 years ago

Unfortunately, there is no documentation yet on how to use the module from Python. The code in cli.py implements the command line interface and could serve as inspiration. I put together a minimal example that shows how to import a text file in CoNLL-U format and how to compute a surface-based (from surface.py), a sentence-based (from sentence.py) and a dependency-based (from dependency.py) measure. I hope this helps you getting started! Feel free to ask if you encounter any problems.

import itertools

from textcomplexity import surface, sentence, dependency
from textcomplexity.utils import conllu

filename = "goethe_werther.conllu"

with open(filename, encoding="utf-8") as f:
    tokens, tagged, graphs = zip(*conllu.read_conllu_sentences(f, ignore_punct=True, punct_tags=set(["PUNCT"])))
    tokens = list(itertools.chain.from_iterable(tokens))

# Most surface-based measures are not length independent. Therefore,
# it is better to compute these measures on windows of fixed size and
# to use the mean.
mean_ttr, ci_ttr, scores_ttr = surface.bootstrap(surface.type_token_ratio, tokens, window_size=1000)

# Sentence-based measures and dependency-based measures operate on
# individual sentences, i.e. no bootstrap is needed.
mean_sentence_length, stdev_sentence_length = sentence.sentence_length_words(tagged)
mean_add, stdev_add = dependency.average_dependency_distance(graphs)

print(f"Mean type-token ratio (computed on windows of 1000 tokens): {mean_ttr:.4f}")
print(f"Mean sentence length: {mean_sentence_length:.4f}")
print(f"Mean average dependency distance: {mean_add:.4f}")