renaud / neuroNER

named entity recognizer for neuronal cells, based on UIMA Ruta rules
GNU Lesser General Public License v3.0
7 stars 8 forks source link

weird capitalization effects for neuroNER tagging #23

Closed stripathy closed 9 years ago

stripathy commented 9 years ago

in the following term: 'thick tufted pyramidal cell', 'thick tufted' is identified, but for the query 'Thick Tufted Pyramidal cell', 'thick tufted' is NOT identified.

Seems to be a somewhat general effect.

s = Sherlok()
annotations = list(s.annotate('neuroner', 'Thick Tufted pyramidal cell'))
for a in annotations:
    print a

(0, 12, 'thick tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000014'})
(6, 12, 'tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000031'})
(13, 22, 'pyramidal', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000001'})
(0, 27, 'thick tufted pyramidal cell', u'Neuron', {})
(6, 27, 'tufted pyramidal cell', u'Neuron', {})
(13, 27, 'pyramidal cell', u'Neuron', {})
(23, 27, 'cell', u'NeuronTrigger', {})
(0, 22, 'thick tufted pyramidal', u'PreNeuron', {})
(6, 22, 'tufted pyramidal', u'PreNeuron', {})
(13, 22, 'pyramidal', u'PreNeuron', {})

s = Sherlok()
annotations = list(s.annotate('neuroner', 'thick tufted pyramidal cell'))
for a in annotations:
    print a

(6, 12, 'Tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000031'})
(13, 22, 'pyramidal', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000001'})
(6, 27, 'Tufted pyramidal cell', u'Neuron', {})
(13, 27, 'pyramidal cell', u'Neuron', {})
(23, 27, 'cell', u'NeuronTrigger', {})
(6, 22, 'Tufted pyramidal', u'PreNeuron', {})
(13, 22, 'pyramidal', u'PreNeuron', {})
stripathy commented 9 years ago

Another example

s = Sherlok()
annotations = list(s.annotate('neuroner', 'Double Bouquet cell'))
for a in annotations:
    print a

(7, 14, 'Bouquet', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000004'})
(7, 19, 'Bouquet cell', u'Neuron', {})
(15, 19, 'cell', u'NeuronTrigger', {})
(7, 14, 'Bouquet', u'PreNeuron', {})
Layer VI Double Bouquet Cell
--------        ------------ coverage= 0.714285714286
neuron expressing Vasoactive Intestinal Peptide
------                                          coverage= 0.127659574468
neuron expressing Neuropeptide Y
-------------------------------- coverage= 1.0
renaud commented 9 years ago

Sorry I can't reproduce. I get the same output whatever the capitalization.

annotations = list(s.annotate('neuroner', 'Thick Tufted pyramidal cell'))
for a in annotations:
        print a
(0, 12, 'Thick Tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000014'})
(6, 12, 'Tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000031'})
(13, 22, 'pyramidal', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000001'})
(0, 27, 'Thick Tufted pyramidal cell', u'Neuron', {})
(6, 27, 'Tufted pyramidal cell', u'Neuron', {})
(13, 27, 'pyramidal cell', u'Neuron', {})
(23, 27, 'cell', u'NeuronTrigger', {})
(0, 22, 'Thick Tufted pyramidal', u'PreNeuron', {})
(6, 22, 'Tufted pyramidal', u'PreNeuron', {})
(13, 22, 'pyramidal', u'PreNeuron', {})

annotations = list(s.annotate('neuroner', 'thick tufted pyramidal cell'))
for a in annotations:
    print a
(0, 12, 'thick tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000014'})
(6, 12, 'tufted', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000031'})
(13, 22, 'pyramidal', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000001'})
(0, 27, 'thick tufted pyramidal cell', u'Neuron', {})
(6, 27, 'tufted pyramidal cell', u'Neuron', {})
(13, 27, 'pyramidal cell', u'Neuron', {})
(23, 27, 'cell', u'NeuronTrigger', {})
(0, 22, 'thick tufted pyramidal', u'PreNeuron', {})
(6, 22, 'tufted pyramidal', u'PreNeuron', {})
(13, 22, 'pyramidal', u'PreNeuron', {})

annotations = list(s.annotate('neuroner', 'Double Bouquet cell'))
for a in annotations:
    print a
(0, 14, 'Double Bouquet', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000005'})
(7, 14, 'Bouquet', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000004'})
(0, 19, 'Double Bouquet cell', u'Neuron', {})
(7, 19, 'Bouquet cell', u'Neuron', {})
(15, 19, 'cell', u'NeuronTrigger', {})
(0, 14, 'Double Bouquet', u'PreNeuron', {})
(7, 14, 'Bouquet', u'PreNeuron', {})

annotations = list(s.annotate('neuroner', 'double bouquet cell'))
for a in annotations:
    print a
(0, 14, 'double bouquet', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000005'})
(7, 14, 'bouquet', u'Morphology', {u'ontologyId': u'HBP_MORPHOLOGY:0000004'})
(0, 19, 'double bouquet cell', u'Neuron', {})
(7, 19, 'bouquet cell', u'Neuron', {})
(15, 19, 'cell', u'NeuronTrigger', {})
(0, 14, 'double bouquet', u'PreNeuron', {})
(7, 14, 'bouquet', u'PreNeuron', {})

Could you do:

pip freeze | grep sherlok

I have sherlok==0.1.4

renaud commented 9 years ago

fixed with new release https://github.com/sherlok/sherlok/releases/tag/0.1.5