Closed maltelueken closed 11 months ago
The overleaf document covers XML, YAML, JSON, and tabular formats (e.g., TSV, CSV) so far. Is this sufficient or should I look for more formats?
Thanks Malte, I think this sums up the basic formats we should consider.
TEI looks interesting, very mature, maintained, and with incredibly detailed documentation. I am interested in learning more about TEI.
REFI also looks interesting, but (as you say) it might only be useful if we really decide to continue with a QDAS application. Furthermore, it is already possible to export a project in Atlas.ti in a specific QDA-XML format (.qdpx) for use between applications, and I'm curious how this relates to REFI. So I'd stable that for now.
Another one of interest might be Resource Description Framework (RDF/XML). An example of this is the Lemon model, which also employs LMF ("LMF is the ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)"). Lemon splits the 'ontological entity' (the thing) and the 'lexical sense' of that thing, and builds on that, which has some advantages. However, Lemon is a model built on existing standards (and a Java API), and in that sense we might need to open another task if we want to investigate this further.
@maltelueken Could you also look at Folia, developed by the Radboud University? This is a standard for encoding NLP annotations of text? This is a kind of alternative for TEI.
Yes I will look at it!
I found this nice paper which compares FoLiA to other linguistic annotation formats. They are mostly XML-based. FoLiA seems to be more of a framework for how to store annotated text (or other) data in XML files, whereas TEI for instance is more specific about the different annotation tags and metadata entries. Taken together, these formats are for annotated data and not necessarily for raw text data as described in the document. Given that they are all based on XML, we should probably choose XML as our raw data storage format for consistency.
For the storage of annotated text (as the output of our NLP pipelines), FoLiA seems to be a very promising candidate (see paper, section 2). The main points are that it is flexible, somewhat human readable, and explitcit (can be validated). Moreover, there are many Dutch tools that interoperate with FoLiA, like BlackLab, Brat, and Frog, and there is even a Python package for the format. It also has an extension for SpaCy.
The problem with most of these formats is that they are very complex and it might take users a considerable amout of time to learn them.
In what format will we store the data (text) files of the project? Including metadata.
Consider:
Comparison of different storage formats in this document: https://www.overleaf.com/project/6213b346e75ce35ccc7d811b
Addition by Kevin (bullet point duplicate in CLARIAH story):