proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

Question: Converting between FoLiA and UIMA CAS XMI XML #47

Open pirolen opened 2 years ago

pirolen commented 2 years ago

Would it be an idea to investigate the interoperability between the FoLiA and the "UIMA CAS XMI XML" formats? If I understand it right, this would allow data exchange between the FoLiA and the UIMA ecosystems.

Would it be of interest to the community, and would foliapy and dkpro-cassis (https://github.com/dkpro/dkpro-cassis) be instrumental for this?

Many thanks for any pointers!

pirolen commented 2 years ago

P.S. This question came up while looking at possible data exchange between FoLiA and INCEpTION (https://github.com/inception-project/inception).

Another data formats that would allow this would be CONLL-U or TEI5. I wonder what is most practical in a named entity tagging/linking scenario.

proycon commented 2 years ago

Would it be of interest to the community, and would foliapy and dkpro-cassis (https://github.com/dkpro/dkpro-cassis) be instrumental for this?

That library looks promising yeah, with that in combination with foliapy, a convertor could be implemented. The main problem is to find a mapping from various FoLiA structures to UIMA CAS and vice versa, that's often far from trivial.

Another data formats that would allow this would be CONLL-U or TEI5. I wonder what is most practical in a named entity tagging/linking scenario.

CONLL-U is significantly simpler so converting that from/to FoLiA is doable, there's already a tool in foliatools for it.

pirolen commented 2 years ago

That library looks promising yeah, with that in combination with foliapy, a convertor could be implemented. The main problem is to find a mapping from various FoLiA structures to UIMA CAS and vice versa, that's often far from trivial.

I believe so; perhaps there is no need to prioritize this.

reckart commented 9 months ago

That library looks promising yeah, with that in combination with foliapy, a convertor could be implemented. The main problem is to find a mapping from various FoLiA structures to UIMA CAS and vice versa, that's often far from trivial.

UIMA is agnostic to the annotations schema - it just provides the means of defining a schema and working with the annotated texts.

There are other projects like DKPro Core that provide type systems.

Additionally, there are annotation tools like INCEpTION that allow the user to define their own annotation schema (called "layers" in INCEpTION) and then export/import that to/from UIMA CAS.

If I am not mistaken, FoLiA is a fully specified format that does not support "custom annotation types" - all elements are provided by the FoLiA spec and other elements are not supported. So if I am correct and there is no support for custom annotation types in FoLiA, a fully generic mapping from UIMA CAS to FoLiA or from INCEpTION custom annotation layers to FoLiA would not be possible.

👉 FoLiA <-> UIMA CAS (DKPro Core) -- It should be possible to map a bunch of those to/from the DKPro Core types (paragraph, sentence, token, lemma, etc.) - not fully but at least to some degree. It would be interesting to figure out to which degree.

👉 Tooling interoperability Since e.g. INCEpTION knows the DKPro Core types, that would also make it easy then to use the mapped data in the annotation tool. Similarly, it would enable to some degree to use texts annotated with INCEpTION or processed with DKPro Core with the FoLiA tools.

reckart commented 9 months ago

Btw. if anybody has implemented any conversions between FoLiA and UIMA CAS, it would be great if you could share them (e.g. link them here) for others to use as potential starting points for own conversions or more complete conversions.

proycon commented 9 months ago

If I am not mistaken, FoLiA is a fully specified format that does not support "custom annotation types" - all elements are provided by the FoLiA spec and other elements are not supported. So if I am correct and there is no support for custom annotation types in FoLiA, a fully generic mapping from UIMA CAS to FoLiA or from INCEpTION custom annotation layers to FoLiA would not be possible.

Correct, FoLiA defines types for various kinds of structural and linguistic annotation. It does not, however, define the tagsets used for linguistic annotation, those are user defined. So we define for example the concept "part-of-speech annotation" and the user determines what tagset to use with that (for which there are formal structures available). I'm not very familiar with DKPro Core, but this looks similar in scope.

👉 FoLiA <-> UIMA CAS (DKPro Core) -- It should be possible to map a bunch of those to/from the DKPro Core types (paragraph, sentence, token, lemma, etc.) - not fully but at least to some degree. It would be interesting to figure out to which degree.

Indeed, that sounds doable.

👉 Tooling interoperability Since INCEpTION knows the DKPro Core types, that would also make it easy then to use the mapped data in the annotation tool. Similarly, it would enable to some degree to use texts annotated with INCEpTION or processed with DKPro Core with the FoLiA tools.

Having such interoperability would be quite nice yes.