warelab / gramene-solr

Apache License 2.0
0 stars 0 forks source link

wrong counts on some suggestions #7

Closed ajo2995 closed 8 years ago

ajo2995 commented 8 years ago

For example, the gene OS06G0133000 is counted twice in the "WAXY" suggestion.

This is because it has a name "WAXY" and a synonym "Waxy". The facet counting on the _terms field used in suggestions/genes.js sees these two labels as distinct and assumes that there are n genes with the "WAXY" term and m genes with the "Waxy" term. It decides to combine them into a single suggestion with n+m genes associated with "WAXY".

One possible solution is to create the _terms field directly in genes/mongo2solr.js instead of letting solr create the _terms field (a copyField) from the name, id, synonyms, xrefs, and so on. Then, the script would have to do some processing and the suggestions would all appear in lower case rather than their original appearance.

Alternatively, we can let WAXY and Waxy be separate suggestions. This is much easier to do, but the user is forced to decide which suggestion to click on.

ajo2995 commented 8 years ago

Going with a little of both approaches.

  1. create a unique _terms field for a gene by taking whichever version is seen first in a gene doc (key is lowercase value is originalCase). The name field is added before the synonym field and before the dbxrefs.
  2. remove the logic to combine terms from different gene docs to avoid double counting any genes associated with a term. There may still be multiple versions of a term among the suggestions, but at least the number of genes reported should be correct. Together, these two fixes should make one suggestion for waxy.