I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker), and, sometime, POS tags.
This would be cool. Not in the least because tools like GROBID allow you to parse out things like references and headers/footers etc. and saving it as TEI-xml. [I'm just starting to look into quanteda, so sorry if quanteda can do this natively already]
I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker), and, sometime, POS tags.
See: https://tei-c.org/ https://tei-c.org/activities/projects/ https://dracor.org/