entity mention mismatch error when importing from BioC

stefan-it commented 3 years ago

Hi @lfurrer ,

thanks for open sourcing the bconv library :heart:

I'm currently trying to evaluate an own trained ELECTRA model for French, and publicly available NER corpora for French are quite... limited.

I found the QUAERO French Medical Corpus, that is publicly available and is available in both Brat and BioC.

So I just downloaded and extracted the BioC annotations:

wget "https://quaerofrenchmed.limsi.fr/QUAERO_FrenchMed_BioC.zip" 
unzip QUAERO_FrenchMed_BioC.zip

to use it in bconv:

import bconv

conll = bconv.load("QUAERO_BioC/corpus/train/MEDLINE_train_bioc", fmt="bioc_xml")

But then the following error message is thrown:

AssertionError                            Traceback (most recent call last)
<ipython-input-3-221c9f1a0456> in <module>
----> 1 conll = bconv.load("QUAERO_BioC/corpus/train/MEDLINE_train_bioc", fmt="bioc_xml")

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in load(source, fmt, mode, id_, **options)
     75         fmt = _guess_format(source, LOADERS)
     76     loader = LOADERS[fmt](**options)
---> 77     return _load(loader, mode, source, id_)
     78 
     79 

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in _load(loader, mode, source, id_)
     82         content = loader.iter_documents(source)
     83     else:
---> 84         content = loader.load_one(source, id_)
     85 
     86     if hasattr(loader, 'document'):

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/_load.py in load_one(self, source, id_)
     50 
     51     def load_one(self, source, id_):
---> 52         return self.collection(source, id_)
     53 
     54     def collection(self, source, id_):

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in collection(self, source, id_)
     60         collection.metadata = self._meta_dict(coll_node)
     61         for doc in docs:
---> 62             collection.add_document(self._document(doc))
     63         return collection
     64 

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in _document(self, node)
     80             section = doc[-1]
     81             section.metadata = infon
---> 82             section.add_entities(anno)
     83             # Get infon elements on sentence level.
     84             for sent, sent_node in zip(section,

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
    149                 while entity.start >= sent.end:
    150                     sent = next(sentences)
--> 151                 sent.add_entities((entity,))
    152         except StopIteration:
    153             logging.warning('annotations outside character range')

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
    187         for entity in entities:
    188             term = self.text[entity.start-self.start:entity.end-self.start]
--> 189             assert entity.text == term, \
    190                 'entity mention mismatch: {} vs. {}'.format(entity.text, term)
    191             self.entities.append(entity)

AssertionError: entity mention mismatch: Détection vs. Détectio

I'm using Python 3.8 and the latest bconv version. Any help/hints for importing the format would highly be appreciated, because I wanted to export the dataset to IOB to use it with e.g. Transformers or Flair.

Many thanks,

Stefan

(Also /ccing @pjox who's always interested in French NER 😅 )

lfurrer commented 3 years ago

Hi Stefan,

bconv is being strict about annotation consistency: what it's telling you here is that the text span indicated by the offsets doesn't match the text attribute. I didn't look at the QUAERO data, but it looks a lot like the location.length is one character too short.

You should try the following:

coll = bconv.load(PATH, fmt='bioc_xml', byte_offsets=False)

The thing is that the BioC specs dictate that offsets are counted in octets (bytes) of the UTF-8 encoding of the text. This means that "Détection" has a length of 10, not 9:

>>> len('Détection'.encode('utf8'))
10

Arguably, this way of calculating offsets is inconvenient for modern, Unicode-aware software (in particular Python), so people just ignore the specs and count Unicode codepoints instead. That's why there's an option for this, available for both input and output (see the docs).

lfurrer commented 3 years ago

I'm using Python 3.8 and the latest bconv version. Any help/hints for importing the format would highly be appreciated, because I wanted to export the dataset to IOB to use it with e.g. Transformers or Flair.

When converting to IOB format, please note that the CoNLL output is somewhat ... special if you have overlapping entities in the source (to check this, search for ; in the IOB column after conversion). Overlapping annotations can't easily be represented in IOB; I have plans for adding flattening strategies, but haven't gotten around to implementing them yet.

stefan-it commented 3 years ago

Hi @lfurrer,

thanks for your help!

I tried using the byte_offsets=False parameter, but then there's another error message:

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in load(source, fmt, mode, id_, **options)
     75         fmt = _guess_format(source, LOADERS)
     76     loader = LOADERS[fmt](**options)
---> 77     return _load(loader, mode, source, id_)
     78 
     79 

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in _load(loader, mode, source, id_)
     82         content = loader.iter_documents(source)
     83     else:
---> 84         content = loader.load_one(source, id_)
     85 
     86     if hasattr(loader, 'document'):

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/_load.py in load_one(self, source, id_)
     50 
     51     def load_one(self, source, id_):
---> 52         return self.collection(source, id_)
     53 
     54     def collection(self, source, id_):

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in collection(self, source, id_)
     60         collection.metadata = self._meta_dict(coll_node)
     61         for doc in docs:
---> 62             collection.add_document(self._document(doc))
     63         return collection
     64 

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in _document(self, node)
     80             section = doc[-1]
     81             section.metadata = infon
---> 82             section.add_entities(anno)
     83             # Get infon elements on sentence level.
     84             for sent, sent_node in zip(section,

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
    149                 while entity.start >= sent.end:
    150                     sent = next(sentences)
--> 151                 sent.add_entities((entity,))
    152         except StopIteration:
    153             logging.warning('annotations outside character range')

~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
    187         for entity in entities:
    188             term = self.text[entity.start-self.start:entity.end-self.start]
--> 189             assert entity.text == term, \
    190                 'entity mention mismatch: {} vs. {}'.format(entity.text, term)
    191             self.entities.append(entity)

AssertionError: entity mention mismatch: douleurs chroniques vs. douleurs

:thinking:

lfurrer commented 3 years ago

Oh. I presume this is annotation T105 in EMEA_train_bioc: that's a discontinuous annotation, ie. "douleurs" and "chroniques" are separated by a few words. Unfortunately, bconv can't handle discontinuous annotations yet (and apparently there's a bug that prevents it from simply decomposing the parts into separate entities).

I'm afraid bconv isn't ready yet for this dataset :disappointed:. For the moment, you're probably best off using Sampo Pyysalo's standoff-to-conll converter for creating IOB annotations from the brat version of QUAERO.

lfurrer commented 3 years ago

Late follow-up: adding support for entity flattening in #7.

lfurrer / bconv

entity mention mismatch error when importing from BioC #2