Determine storage format for text files

maltelueken commented 2 years ago

In what format will we store the data (text) files of the project? Including metadata.

Consider:

existing standards
compatibility
future integration with other systems if possible
ease of use

Comparison of different storage formats in this document: https://www.overleaf.com/project/6213b346e75ce35ccc7d811b

Addition by Kevin (bullet point duplicate in CLARIAH story):

Ineo is the new front-end for CLARIAH tools (including the Media Suite). It is supposed to be delivered in early 2022. There is discussion on an import function with Ineo in JSON format. This might impact the data standards we want to use if we decide to move toward CLARIAH.

maltelueken commented 2 years ago

The overleaf document covers XML, YAML, JSON, and tabular formats (e.g., TSV, CSV) so far. Is this sufficient or should I look for more formats?

maltelueken commented 2 years ago

I also found two standards (XML) that might be useful for this project:

TEI: This could apply to data storage in XML
REFI: This applies more to data that has been processed by qualitative analysis software

kevinpijpers commented 2 years ago

Thanks Malte, I think this sums up the basic formats we should consider.

TEI looks interesting, very mature, maintained, and with incredibly detailed documentation. I am interested in learning more about TEI.

REFI also looks interesting, but (as you say) it might only be useful if we really decide to continue with a QDAS application. Furthermore, it is already possible to export a project in Atlas.ti in a specific QDA-XML format (.qdpx) for use between applications, and I'm curious how this relates to REFI. So I'd stable that for now.

Another one of interest might be Resource Description Framework (RDF/XML). An example of this is the Lemon model, which also employs LMF ("LMF is the ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)"). Lemon splits the 'ontological entity' (the thing) and the 'lexical sense' of that thing, and builds on that, which has some advantages. However, Lemon is a model built on existing standards (and a Java API), and in that sense we might need to open another task if we want to investigate this further.

kevinpijpers commented 2 years ago

@maltelueken Could you also look at Folia, developed by the Radboud University? This is a standard for encoding NLP annotations of text? This is a kind of alternative for TEI.

maltelueken commented 2 years ago

Yes I will look at it!

maltelueken commented 2 years ago

I found this nice paper which compares FoLiA to other linguistic annotation formats. They are mostly XML-based. FoLiA seems to be more of a framework for how to store annotated text (or other) data in XML files, whereas TEI for instance is more specific about the different annotation tags and metadata entries. Taken together, these formats are for annotated data and not necessarily for raw text data as described in the document. Given that they are all based on XML, we should probably choose XML as our raw data storage format for consistency.

For the storage of annotated text (as the output of our NLP pipelines), FoLiA seems to be a very promising candidate (see paper, section 2). The main points are that it is flexible, somewhat human readable, and explitcit (can be validated). Moreover, there are many Dutch tools that interoperate with FoLiA, like BlackLab, Brat, and Frog, and there is even a Python package for the format. It also has an extension for SpaCy.

The problem with most of these formats is that they are very complex and it might take users a considerable amout of time to learn them.

navigating-stories / notebooks

Determine storage format for text files #15