quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
118 stars 28 forks source link

Support the TEI format #159

Open koheiw opened 4 years ago

koheiw commented 4 years ago

I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker), and, sometime, POS tags.

See: https://tei-c.org/ https://tei-c.org/activities/projects/ https://dracor.org/

kbenoit commented 4 years ago

Great idea. There is a package called https://github.com/michaelgavin/tei2r/tree/master/R, but it looks pretty inactive.

sdspieg commented 3 years ago

This would be cool. Not in the least because tools like GROBID allow you to parse out things like references and headers/footers etc. and saving it as TEI-xml. [I'm just starting to look into quanteda, so sorry if quanteda can do this natively already]