Open mobashgr opened 2 years ago
Hi Ghadeer,
Converting BioC to CoNLL can be tricky, because CoNLL doesn't have the same expressive power as a stand-off format like BioC. In BioC, annotations may have gaps or they may overlap, but in CoNLL this doesn't work nicely, so simplification is needed, as described in the entity-flattening docs. The "B-Chemicalentity;B-Chemicalentity" looks like you specified avoid_overlaps=None
instead of the default "keep-longer"
strategy.
But it's also possible that there's a bug or an edge case I didn't consider. I'm sure I can get to the root of the problem if you provide the code you used for conversion and a minimal excerpt of the BioC file along with the unexpected output (eg. one paragraph with one or two annotations). The same goes for the many-to-one mappings you mentioned.
Relations aren't supported in the CoNLL format, as far as bconv
is concerned at least. I wouldn't know how to represent relations in CoNLL (maybe derive something from the dependency notation used in the scheme for syntax parsing?). Have you seen an example of relations encoded in CoNLL in the bio/med domain?
Hi Lenz, Thanks for the heads-up, I will try it out and keep you posted. Sure, I can provide the code I used for conversion and sample of the used BioC file and a snippet of the buggy CoNLL output.
Yes, you are completely right. I just mixed up. I meant the trivial way to extract relations from BioC XML. The documentation isn't clear for me or at least doesn't have an example/hint on relations conversion (please correct me if I am wrong)
Best, Ghadeer
I had a look the BC7 corpus and realised it can't be parsed by bconv
. I then found out that BioC allows annotations without a <location>
element, ie. entities that aren't text-bound – I wasnt' aware of that, but the DTD clearly allows it, so this is definitely an issue in bconv
. I'm not so sure how to deal with this, because the assumption that entities are anchored in the text is built deep into bconv
's data model... I'll try to come up with a solution eventually.
For the relations, maybe this little REPL log may be of help:
>>> import bconv
>>> coll = bconv.load('test/data/bioc_xml/BC5CDR-example.xml', fmt='bioc_xml')
>>> doc = coll[0]
>>> doc
<Document with 2 sections at 0x7f567ee241c0>
>>> rel = next(doc.iter_relations())
>>> rel
<Relation with 2 members at 0x7f565ef857c0>
>>> rel.type
'CID'
>>> rel[0]
RelationMember(refid='1', role='Chemical')
>>> rel[1]
RelationMember(refid='2', role='Disease')
>>> entities_by_refid = {e.id: e for e in doc.iter_entities()}
>>> e = entities_by_refid[rel[0].refid]
>>> e
<bconv.doc.document.Entity at 0x7f565ee6a540>
>>> e.text
'Lidocaine'
>>> e.metadata
{'type': 'Chemical', 'cui': 'D008012'}
bconv was able to convert NLM-Chem (track 2) in BC7 for example but had to do lots of post-processing as I have explained before. This corpus has <location>
. I will try the solution that you have proposed and get back to you. Many thanks for the heads-up.
Hi Lenz! Great Library and a life saver (Y). However, I want to state that I have been doing extensive post-processing after converting BioC XML to Conll, even if I set byte_offsets to False. Briefly, the problem is many tokens and their corresponding labels exist as if there is one token. The second problem is the labels would look something like "B-Chemicalentity;B-Chemicalentity". Here is the corpus that I am using.
Regarding the relationships, can you provide an example or extra hints other than the ones in the documentation to convert relations from BioC XML to Conll?
Best, Ghadeer