renaud / neuroNER

named entity recognizer for neuronal cells, based on UIMA Ruta rules
GNU Lesser General Public License v3.0
7 stars 8 forks source link

only keep a single ontology term corresponding to the same span of text #44

Closed stripathy closed 8 years ago

stripathy commented 8 years ago

e.g. Lateral hypothalamus GAD65-GFP low-threshold spiking neurons [0, 7, 'Lateral', u'UNKN_REGION:15'] [8, 20, 'hypothalamus', u'UNKN_REGION:5443'] [8, 20, 'hypothalamus', u'ABA_REGION:1097'] [21, 27, 'GAD65-', u'NCBI_GENE:14417'] [27, 30, 'GFP', u'Missing'] [31, 44, 'low-threshold', u'HBP_EPHYS:0000110'] [45, 52, 'spiking', u'HBP_EPHYS_TRIGGER:0000003'] [53, 60, 'neurons', u'NeuronTrigger']

and

medial prefrontal cortex stimulated non-fast spiking interneuron in basolateral amygdala [0, 24, 'medial prefrontal cortex', u'UNKN_REGION:491'] [25, 35, 'stimulated', u'Missing'] [36, 44, 'non-fast', u'HBP_EPHYS:0000090'] [40, 44, 'fast', u'HBP_EPHYS:0000080'] [45, 52, 'spiking', u'HBP_EPHYS_TRIGGER:0000003'] [53, 64, 'interneuron', u'NeuronTrigger'] [65, 67, 'in ', 'Missing'] [68, 79, 'basolateral', u'UNKN_REGION:24'] [80, 88, 'amygdala', u'ABA_REGION:278']

renaud commented 8 years ago
  1. keep larges annotation first
  2. then hierarchies
    1. Layers
    2. ABA regions
    3. unknown regions

one cheap solution is also to eliminate duplicate in the lexical resources (e.g. layer V)

all these can be dealt in neuroNER python code