tsproisl / textcomplexity

Linguistic and stylistic complexity measures for (literary) texts
GNU General Public License v3.0
76 stars 12 forks source link

TypeError #5

Open melissasunnivahill opened 9 months ago

melissasunnivahill commented 9 months ago

I've been attempting to run analyses from the textcomplexity library but keep getting the following error:

TypeError: UdToken.__new__() missing 9 required positional arguments: 'form', 'lemma', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', and 'misc'

Here's a deeper look at what's happening: !txtcomplexity -i conllu 'output.conllu'

Traceback (most recent call last): File "/usr/local/bin/txtcomplexity", line 12, in textcomplexity.cli.main() File "/usr/local/lib/python3.10/dist-packages/textcomplexity/cli.py", line 194, in main sentences, graphs = zip(conllu.read_conllu_sentences(f, ignore_case=args.ignore_case)) File "/usr/local/lib/python3.10/dist-packages/textcomplexity/utils/conllu.py", line 16, in read_conllu_sentences for sentence, sent_id in _read_conllu(f, ignore_case): File "/usr/local/lib/python3.10/dist-packages/textcomplexity/utils/conllu.py", line 66, in _read_conllu sentence.append(UdToken(fields))

TypeError: UdToken.__new__() missing 9 required positional arguments: 'form', 'lemma', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', and 'misc'

Is this related to an error in my conllu file or how I'm using the textcomplexity library? Any help would be much appreciated! :)

tsproisl commented 9 months ago

Could you share the first few couple of lines from your input file?

melissasunnivahill commented 9 months ago

Sure! Here is what the first few lines of my conllu file looks like:

1 # # X XX 2 dep 2 Mixtures mixture VERB VBZ 2 ROOT 3

SPACE   _SP _   2   dep _   _

4 The the DET DT 6 det 5 next next ADJ JJ 6 amod 6 time time NOUN NN 2 npadvmod 7 you you PRON PRP 8 nsubj 8 are be AUX VBP 6 relcl 9 at at ADP IN 8 prep 10 the the DET DT 11 det 11 beach beach NOUN NN 9 pobj 12 , , PUNCT , 2 punct 13 pick pick VERB VB 2 conj 14 up up ADP RP 13 prt 15 a a DET DT 16 det 16 handful handful NOUN NN 13 dobj 17 of of ADP IN 16 prep 18 sand sand NOUN NN 17 pobj 19 . . PUNCT . 2 punct 20

tsproisl commented 9 months ago

I don’t know if GitHub messed with the formatting, but it seems like the third token is a newline character? The txtcomplexity tool assumes that token information is on a single line, i.e. it cannot deal with tokens that contain literal newline characters and therefore span multiple lines. Out of curiosity: Do you happen to know how that file was created?

melissasunnivahill commented 9 months ago

That makes sense, thanks! And my PhD advisor wrote a python program to convert txt files to conllu format; happy to add the code if you're interested in looking at it :)