tsproisl / textcomplexity

Linguistic and stylistic complexity measures for (literary) texts
GNU General Public License v3.0
76 stars 12 forks source link

Issue using textcomplexity #1

Open li-xuyang28 opened 3 years ago

li-xuyang28 commented 3 years ago

Hi,

Thank you for developing such a comprehensive resource for complexity computation. I am trying the package but ran into some errors,

Traceback (most recent call last):
  File "/usr/local/bin/txtcomplexity", line 12, in <module>
    textcomplexity.cli.main()
  File "/usr/local/lib/python3.7/dist-packages/textcomplexity/cli.py", line 147, in main
    results.extend(surface_based(tokens, args.window_size, args.all_measures))
  File "/usr/local/lib/python3.7/dist-packages/textcomplexity/cli.py", line 60, in surface_based
    mean, stdev, _ = surface.bootstrap(measure, tokens, window_size, strategy="spread")
  File "/usr/local/lib/python3.7/dist-packages/textcomplexity/surface.py", line 478, in bootstrap
    for window in windows.disjoint_windows(tokens, window_size, strategy):
  File "/usr/local/lib/python3.7/dist-packages/textcomplexity/utils/windows.py", line 30, in disjoint_windows
    assert window_size <= text_length
AssertionError

I used run_stanza.py to prepare the conllu file. Could you please suggest what might I do wrong?

Best

tsproisl commented 3 years ago

Thank you. Did you use the txtcomplexity script? Since most of the measures are dependent on the length of the input text, the script divides each text into parts (“windows”) of a given length (the default is 1000 words), computes the complexity measures for each part, and outputs the mean (this makes the values comparable between texts). When a text is split into windows, any remaining words are discarded. For example, using a window size of 500, a text of 4562 words is split into 9 windows and the remaining 62 words are discarded.

The error message suggests that the length of your input text is smaller than the window size. Since only “full” windows are taken into account, the script cannot compute the complexity measures. The easiest way to work around this problem is to set a smaller window size using the --window-size option. For example, to set the window size to 500 words:

txtcomplexity --input-format conllu --window-size 500 <file>
a-milenkin commented 3 years ago

Get error after: txtcomplexity --input-format conllu test_text.txt

image

sentence in test_text.txt file was taken from example:

1 Das ART 3 NK (TOP(S(NP 2 fremde ADJA 3 NK 3 Schiff NN 4 SB ) 4 war VAFIN -1 -- 5 nicht PTKNEG 6 NG (AVP 6 allein ADV 4 MO ) 7 . $. 6 -- *))

Why?

tsproisl commented 3 years ago

The input you are using is actually not an example for the CoNLL-U format but for the custom tsv format (I've tried to make this clearer in the README). This means you should use txtcomplexity --input-format tsv test_text.txt, instead. Unfortunately, there was a bug in the function that reads the custom tsv input, but I've just released version 0.9.1 with a fix, so please update your installation. Also note that simply using the two example sentences from the README will lead to another error message because the text is shorter than the default window size. For testing purposes, you could adjust the window size (for real applications, you would use a much larger window size):

txtcomplexity -i tsv --window-size 12 test_text.txt

And here are the contents of test_text.txt:

1   Das ART 3   NK  (TOP(S(NP*
2   fremde  ADJA    3   NK  *
3   Schiff  NN  4   SB  *)
4   war VAFIN   -1  --  *
5   nicht   PTKNEG  6   NG  (AVP*
6   allein  ADV 4   MO  *)
7   .   $.  6   --  *))

1   Sieben  CARD    2   NK  (TOP(S(NP*
2   weitere ADJA    3   MO  *)
3   begleiteten VVFIN   -1  --  *
4   es  PPER    3   OA  *
5   .   $.  4   --  *))
a-milenkin commented 3 years ago

Thank's for the previous answer!

But not I see a new error after this command: python3 run_stanza.py --language en --output-dir . to_conll.txt

image

in to_conll.txt there is this random text:

"Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like)."

What is wrong? Where can I find the resources.json file for converting text to CoNlLL -U format?

tsproisl commented 3 years ago

This problem seems to occur if you haven't used stanza before and there is no resources.json file, yet. I've updated the script to check for the file and to download it, if necessary.