Closed stefan-it closed 3 years ago
Hi Stefan,
bconv
is being strict about annotation consistency: what it's telling you here is that the text span indicated by the offsets doesn't match the text
attribute. I didn't look at the QUAERO data, but it looks a lot like the location.length
is one character too short.
You should try the following:
coll = bconv.load(PATH, fmt='bioc_xml', byte_offsets=False)
The thing is that the BioC specs dictate that offsets are counted in octets (bytes) of the UTF-8 encoding of the text.
This means that "Détection"
has a length of 10, not 9:
>>> len('Détection'.encode('utf8'))
10
Arguably, this way of calculating offsets is inconvenient for modern, Unicode-aware software (in particular Python), so people just ignore the specs and count Unicode codepoints instead. That's why there's an option for this, available for both input and output (see the docs).
I'm using Python 3.8 and the latest
bconv
version. Any help/hints for importing the format would highly be appreciated, because I wanted to export the dataset to IOB to use it with e.g. Transformers or Flair.
When converting to IOB format, please note that the CoNLL output is somewhat ... special if you have overlapping entities in the source (to check this, search for ;
in the IOB column after conversion).
Overlapping annotations can't easily be represented in IOB; I have plans for adding flattening strategies, but haven't gotten around to implementing them yet.
Hi @lfurrer,
thanks for your help!
I tried using the byte_offsets=False
parameter, but then there's another error message:
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in load(source, fmt, mode, id_, **options)
75 fmt = _guess_format(source, LOADERS)
76 loader = LOADERS[fmt](**options)
---> 77 return _load(loader, mode, source, id_)
78
79
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/__init__.py in _load(loader, mode, source, id_)
82 content = loader.iter_documents(source)
83 else:
---> 84 content = loader.load_one(source, id_)
85
86 if hasattr(loader, 'document'):
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/_load.py in load_one(self, source, id_)
50
51 def load_one(self, source, id_):
---> 52 return self.collection(source, id_)
53
54 def collection(self, source, id_):
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in collection(self, source, id_)
60 collection.metadata = self._meta_dict(coll_node)
61 for doc in docs:
---> 62 collection.add_document(self._document(doc))
63 return collection
64
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/fmt/bioc.py in _document(self, node)
80 section = doc[-1]
81 section.metadata = infon
---> 82 section.add_entities(anno)
83 # Get infon elements on sentence level.
84 for sent, sent_node in zip(section,
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
149 while entity.start >= sent.end:
150 sent = next(sentences)
--> 151 sent.add_entities((entity,))
152 except StopIteration:
153 logging.warning('annotations outside character range')
~/.venvs/flair-2/lib/python3.8/site-packages/bconv/doc/document.py in add_entities(self, entities)
187 for entity in entities:
188 term = self.text[entity.start-self.start:entity.end-self.start]
--> 189 assert entity.text == term, \
190 'entity mention mismatch: {} vs. {}'.format(entity.text, term)
191 self.entities.append(entity)
AssertionError: entity mention mismatch: douleurs chroniques vs. douleurs
:thinking:
Oh. I presume this is annotation T105 in EMEA_train_bioc
: that's a discontinuous annotation, ie. "douleurs"
and "chroniques"
are separated by a few words. Unfortunately, bconv
can't handle discontinuous annotations yet (and apparently there's a bug that prevents it from simply decomposing the parts into separate entities).
I'm afraid bconv
isn't ready yet for this dataset :disappointed:.
For the moment, you're probably best off using Sampo Pyysalo's standoff-to-conll converter for creating IOB annotations from the brat version of QUAERO.
Late follow-up: adding support for entity flattening in #7.
Hi @lfurrer ,
thanks for open sourcing the
bconv
library :heart:I'm currently trying to evaluate an own trained ELECTRA model for French, and publicly available NER corpora for French are quite... limited.
I found the QUAERO French Medical Corpus, that is publicly available and is available in both Brat and BioC.
So I just downloaded and extracted the BioC annotations:
to use it in
bconv
:But then the following error message is thrown:
I'm using Python 3.8 and the latest
bconv
version. Any help/hints for importing the format would highly be appreciated, because I wanted to export the dataset to IOB to use it with e.g. Transformers or Flair.Many thanks,
Stefan
(Also /ccing @pjox who's always interested in French NER 😅 )