spraakbanken / korp-frontend

Frontend for Korp, a tool using the IMS Open Corpus Workbench (CWB).
https://spraakbanken.gu.se/en/tools/korp
MIT License
16 stars 8 forks source link

Comparison view treats multi-word value as multiple tokens #385

Open arildm opened 2 months ago

arildm commented 2 months ago
  1. With the Svenska partiprogram och valmanifest (vivill) corpus selected, save two searches for comparison
  2. Compare the searches using the parti attribute
  3. Click any of the multi-word party names, e.g. Folkpartiet liberalerna
  4. Expected: Some results
  5. Actual: No results

Apparently, the API request has cqp2=[_.text_party_name = "Folkpartiet"] [_.text_party_name = "liberalerna"]

arildm commented 2 months ago

The backend /loglike response doesn't distinguish a multi-word value from multiple tokens. Compare these calls:

"han"+verb vs. "hon"+verb by sense: Space in string separates tokens

{ "loglike": {
  "hon..1:-1.000 vara..1:-1.000": 2375.04,
  "han..1:-1.000 vara..1:-1.000": -1774.16,
  "hon..1:-1.000 skola..4:-1.000": 1062.87,
  // ...

"frihet" vs. "jämlikhet" by party: Space in string does not separate tokens

{ "loglike": {
  "Feministiskt initiativ": 78.12,
  "V\u00e4nsterpartiet": 74.7,
  "Moderaterna": -73.75,
  // ...

Perhaps we can interpret the string value as one or more tokens depending on the input queries (set1_cqp and set2_cqp)? But changing the response format would probably be a more robust approach.

arildm commented 2 months ago

This is where the string in the reponse is whitespace-separated: https://github.com/spraakbanken/korp-frontend/blob/38534b82a7902cc5e56a67844485505ffe0f767e/app/scripts/services/backend.ts#L125