Week in Review - Githubissues

Since last Monday we:

committed the majority of our corpus (raw text) into the repo
generated our first TEI schema
marked up two interviews for their structural elements
started thinking about ways to mark up linguistic features (feel free to check out /research/discourseTerms for a taste), and started marking up one text with these features
looked into doing lemmatization via Mystem (in Russian), but here's an English whitepaper by the creator, if you're into computational linguistics
drew a wireframe for our basic site layout

Throughout this, we started to run into issues with keeping our XML consistent, but it's early in the process so we're going to tighten that down another degree this week.

By this time next week, we're going to:

make our basic website layout in HTML/CSS and push it to obdurodon
add linguistic elements to our ODD specs, and tighten down the schema a bit more (see /odd/)
mark up the structural elements of all 5 Putin documents in TEI (results found at /xml/putin/)
standardize our transliteration to Library of Congress w/o diacritics
write XSLT to automate production of input for Mystem (found at /util/)
write XSLT to transform Mystem output back into TEI-conformant XML
research into using MALLET with Russian texts to do topic modelling
commit the remaining few documents into the corpus in raw text format (located at /raw/)

mtm80 / russ-project

Week in Review #34