Project comment Feb. 28, 2018

mtm80 commented 6 years ago

In the past week, we have completed markup of the Zhirinovsky and Putin files, finished researching MALLET, and merged our linguistic analysis tags into our schema. In the coming week, Ian is planning to get a webpage up and running, I will be writinge content for the homepage/about page to the greatest extent possible, Richie will be running Mystem on our documents, and implementing word level analysis into TEI structural markup.

richiebful commented 6 years ago

My hope is that I can finish merging word-level analysis (lemmas) into our TEI by Saturday, so we can start doing manual markup over break, using the categories written in our README (under Methodology). If ya'll are interested in the technical aspect of how I used Mystem (after all, the official documentation is written in Russian) or MALLET, or any tech, I'm going to do a writeup at research/tech-methodology.md which will detail the steps used to reproduce our analysis. I think I'll do an English translation of the Mystem documentation at research/mystem-docs.md.

richiebful commented 6 years ago

Also, I really really recommend that all groups use Github Projects. It really helps break down an enormous undertaking into bite-size pieces, and keeps you organized and accountable for your tasks.

brucknerp commented 6 years ago

Thank you for the recommendation/reminder about GitHub Projects! Our group has been kind of neglecting that tool, but I think especially now in the process is a good time to implement it and keep our workflow organized.

Idi0teque commented 6 years ago

Whoa, what's MALLET? If you don't mind helping us, is it something that could be applied to our Magic Realism project? Also, are you guys still using TEI? How are ODD files working out?

djbpitt commented 6 years ago

@Idi0teque Start with https://programminghistorian.org/lessons/topic-modeling-and-mallet and http://dsl.richmond.edu/dispatch/

danakaufhold commented 6 years ago

Wow, you guys got a lot done this week! I totally agree about GitHub projects; they've definitely helped me and Emily stay more on top of things. I also just took a look at your README; you guys have a crazy amount of different linguistic tags! Your project is totally up my alley as a double linguistics/political science major, so I'm super interested in how you'll be marking up the speeches. I especially think you did a really good job identifying a bunch of propagandistic devices. It'll be cool to see how you use things like TEI to help you mark them in the text. I look forward to your webpage!

richiebful commented 6 years ago

@danakaufhold We decided to add new elements for each of the terms listed in our TEI. I'll just reprint one of those declarations below so you don't need to wade through the ODD.

<elementSpec mode="add" ns="http://ru-rhetoric.obdurodon.org/rr" ident="auth">
          <desc>Contains text that mentions authorities to support one's claims.</desc>
          <classes>
            <memberOf key="model.global.meta"/>
          </classes>
          <content>
            <textNode/>
          </content>
          <attList>
            <attDef ident="authority" usage="opt">
              <desc>A pointer to the authority cited</desc>
              <datatype>
                <rng:ref name="teidata.pointer"/>
              </datatype>
            </attDef>
          </attList>
</elementSpec>

This creates an element auth, that can contain any text, and has an optional attribute called authority that points to the authority cited (this authority is stored somewhere in the TEI header)

We had to use our own namespace because TEI doesn't like when we add to theirs, but that isn't anything too crazy.

richiebful commented 6 years ago

@Idi0teque If you're interested in using MALLET, I'm planning to run it over the weekend and post some documentation in our /research/ directory. I have some poorly written documentation there now, but I'd like to make sure my docs work/are less telegraphic before I go advertise them,

If you're trying to find topics (which the computer finds by assuming that words that are often together are related topically), MALLET is definitely the tool you want. Thinking about topic modeling in the context of fiction gets me thinking: can topic modeling be used to pick up some information about the tropes or common plot elements in the story? Could you fingerprint an author based on the topics present in their work? Apparently, the answer to that one is yes, at least sometimes

Using topic modelling also has some practical concerns. Generally you want to use a list of stop words, or words that are so common that they could confuse a topic model (does "and" or "me" have a topic?). Character names, or in our case, politician names should be stop words, or else you'll get the "Putin" topic. Also, we decided to run our topic model on lemmatized texts, since the Russian case system could create many different forms based on one lemma, potentially screwing up the model.

gabikeane commented 6 years ago

Your GitHub Projects page is enough to make an organizational nerd cry (joyfully, to be clear). Great work guys!

mtm80 / russ-project

Project comment Feb. 28, 2018 #47