warelab / gramene-solr

Apache License 2.0
0 stars 0 forks source link

Do not tokenize on dashes #4

Closed ajo2995 closed 8 years ago

ajo2995 commented 8 years ago

Some genes only have their names in the gene description, for example QSH-1 in oryza sativa japonica (OS01G0848400).

This will require some analysis to see how well we can identify gene labels from free text. Genes with a non-id name or a set of single word synonoms can be used as a test set.

mycrobe commented 8 years ago

It seems to me the correct approach to this is:

  1. Fix the source data. This is not our problem.
  2. See point 1.
  3. Make this search work as expected by not tokenizing on the '-'.
mycrobe commented 8 years ago

We should NOT attempt to extract IDs from freetext fields. That way lies madness.

ajo2995 commented 8 years ago

That sounds tractable. I'll test drive another text field type with the comment

Can insert dashes in the wrong place and still match

ajo2995 commented 8 years ago

I could also split descriptions on white space and add words that have at least 1 digit and 1 letter to the _terms field so they get into suggestions