navigating-stories / notebooks

notebooks
Apache License 2.0
1 stars 1 forks source link

Determine storage format for text files #15

Closed maltelueken closed 11 months ago

maltelueken commented 2 years ago

In what format will we store the data (text) files of the project? Including metadata.

Consider:

Comparison of different storage formats in this document: https://www.overleaf.com/project/6213b346e75ce35ccc7d811b

Addition by Kevin (bullet point duplicate in CLARIAH story):

maltelueken commented 2 years ago

The overleaf document covers XML, YAML, JSON, and tabular formats (e.g., TSV, CSV) so far. Is this sufficient or should I look for more formats?

maltelueken commented 2 years ago

I also found two standards (XML) that might be useful for this project:

kevinpijpers commented 2 years ago

Thanks Malte, I think this sums up the basic formats we should consider.

TEI looks interesting, very mature, maintained, and with incredibly detailed documentation. I am interested in learning more about TEI.

REFI also looks interesting, but (as you say) it might only be useful if we really decide to continue with a QDAS application. Furthermore, it is already possible to export a project in Atlas.ti in a specific QDA-XML format (.qdpx) for use between applications, and I'm curious how this relates to REFI. So I'd stable that for now.

Another one of interest might be Resource Description Framework (RDF/XML). An example of this is the Lemon model, which also employs LMF ("LMF is the ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)"). Lemon splits the 'ontological entity' (the thing) and the 'lexical sense' of that thing, and builds on that, which has some advantages. However, Lemon is a model built on existing standards (and a Java API), and in that sense we might need to open another task if we want to investigate this further.

kevinpijpers commented 2 years ago

@maltelueken Could you also look at Folia, developed by the Radboud University? This is a standard for encoding NLP annotations of text? This is a kind of alternative for TEI.

maltelueken commented 2 years ago

Yes I will look at it!

maltelueken commented 2 years ago

I found this nice paper which compares FoLiA to other linguistic annotation formats. They are mostly XML-based. FoLiA seems to be more of a framework for how to store annotated text (or other) data in XML files, whereas TEI for instance is more specific about the different annotation tags and metadata entries. Taken together, these formats are for annotated data and not necessarily for raw text data as described in the document. Given that they are all based on XML, we should probably choose XML as our raw data storage format for consistency.

For the storage of annotated text (as the output of our NLP pipelines), FoLiA seems to be a very promising candidate (see paper, section 2). The main points are that it is flexible, somewhat human readable, and explitcit (can be validated). Moreover, there are many Dutch tools that interoperate with FoLiA, like BlackLab, Brat, and Frog, and there is even a Python package for the format. It also has an extension for SpaCy.

The problem with most of these formats is that they are very complex and it might take users a considerable amout of time to learn them.