Open janiemi opened 3 years ago
The issue seems to be structural attribute values containing tabs. The statistics query is using CWB's tabulate
command, and when grouping by more than one attribute the values are separated by tabs. If the values also contain tabs, the result can't be parsed. I'm not sure if this can be solved while still using tabulate
, so maybe a note about tabs in the readme will have to do for now, and some better error handling in the code of course.
Ok, thanks for the explanation. Apparently, we have avoided the issue by disallowing tabs in the values of structural attributes as well as positional ones. I didn’t notice any option in cqp
to change the value separator of tabulate
, so I suppose you can’t do more than what you suggested.
The
/count
endpoint returns anIndexError: list index out of range
when trying to search certain Flashback or Familjeliv subcorpora with (certain)group_by
andgroup_by_struct
parameters. For example: https://ws.spraakbanken.gu.se/ws/korp/v8/count?group_by=deprel&group_by_struct=thread_title&cqp=%3Cthread%3E+%5Bpos%20%3D%20%22DT%22%5D&corpus=FLASHBACK-DATOR&default_within=sentence&debug=true results in the following:Does the corpus data perhaps contain something unexpected by
/count
? Anyway, I think it would be better if the code were able to handle that without such an internal-looking error.I got the error with a number of different parameters, though I haven’t tried all combinations:
group_by
:pos
,deprel
,msd
,word
group_by_struct
:thread_title
,text_username
; but notforum_title
cqp
:[]
,[pos="VB"]
,[pos="DT"]
,[msd=".*+.*"]
, but not[pos="RO"]
; with or without anchoring to<text>
or<thread>
, but not when anchoring to<forum>
corpus
:FLASHBACK-DATOR
,FLASHBACK-HEM
,FLASHBACK-POLITIK
,FLASHBACK-SAMHALLE
,FAMILJELIV-FORALDER
,FAMILJELIV-KANSLIGA
; but notFLASHBACK-LIVSSTIL
,FLASHBACK-EKONOMI
,FLASHBACK-FORDON
,FLASHBACK-DROGER
,FLASHBACK-KULTUR
,FAMILJELIV-ALLMANNA-KROPP
,FAMILJELIV-GRAVID
,TWITTER
,TWITTER-2015
(withgroup_by_struct=user_username
),WIKIPEDIA-SV
(withgroup_by_struct=text_title
)It would seem that larger corpora are more likely to cause the error, but that’s not completely consistent, at least if you only take token count into account. And I couldn’t get the error from other than Flashback and Familjeliv subcorpora.
(I came across this issue by accident when testing different combinations of statistics attributes in the frontend.)