spraakbanken / korp-backend

Backend for Korp, Språkbanken's corpus search tool
https://spraakbanken.gu.se/eng/korp
MIT License
15 stars 6 forks source link

/count: List index out of range with at least certain Flashback and Familjeliv corpora #6

Open janiemi opened 3 years ago

janiemi commented 3 years ago

The /count endpoint returns an IndexError: list index out of range when trying to search certain Flashback or Familjeliv subcorpora with (certain) group_by and group_by_struct parameters. For example: https://ws.spraakbanken.gu.se/ws/korp/v8/count?group_by=deprel&group_by_struct=thread_title&cqp=%3Cthread%3E+%5Bpos%20%3D%20%22DT%22%5D&corpus=FLASHBACK-DATOR&default_within=sentence&debug=true results in the following:

{
"ERROR": {
"type": "IndexError",
"value": "list index out of range",
"traceback": [
"Traceback (most recent call last):",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 223, in error_catcher",
"    g(*pargs, **kwargs)",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 213, in f",
"    for response in generator(args, *pargs, **kwargs):",
"  File \"/home/fkkorp/korp-backend/v8/korp.py\", line 1569, in count",
"    if group_by[i][0] in split:",
"IndexError: list index out of range"
]
},
"time": 26.713754177093506
}

Does the corpus data perhaps contain something unexpected by /count? Anyway, I think it would be better if the code were able to handle that without such an internal-looking error.

I got the error with a number of different parameters, though I haven’t tried all combinations:

It would seem that larger corpora are more likely to cause the error, but that’s not completely consistent, at least if you only take token count into account. And I couldn’t get the error from other than Flashback and Familjeliv subcorpora.

(I came across this issue by accident when testing different combinations of statistics attributes in the frontend.)

MartinHammarstedt commented 1 year ago

The issue seems to be structural attribute values containing tabs. The statistics query is using CWB's tabulate command, and when grouping by more than one attribute the values are separated by tabs. If the values also contain tabs, the result can't be parsed. I'm not sure if this can be solved while still using tabulate, so maybe a note about tabs in the readme will have to do for now, and some better error handling in the code of course.

janiemi commented 1 year ago

Ok, thanks for the explanation. Apparently, we have avoided the issue by disallowing tabs in the values of structural attributes as well as positional ones. I didn’t notice any option in cqp to change the value separator of tabulate, so I suppose you can’t do more than what you suggested.