mtm80 / russ-project

2 stars 0 forks source link

Apply Mystem to the Putin NBC interview #30

Closed richiebful closed 6 years ago

richiebful commented 6 years ago

Due Monday, 26 February

Try out Mystem on the Putin interview, and write an XSLT to merge the results back into the existing XML file. Try to make it generalizable enough that any TEI interview that meets our spec can be transformed using it.

richiebful commented 6 years ago

@djbpitt I have a question about XSLT /util/gen-mystem-input.xslt. I'm trying to get all of the Putin utterances in the NBC interview /xml/putin/intervyu-amerikanskomu-telekanalu-nbc.xml as plain text, with each utterance separated by two newlines. I've been using lxml with Python to generate the output (shown in /util/output.txt), but I also tried the browser with an XSLT association. Every time, it just dumps all the text nodes in the document, in order. Do you have an idea of what's going on?

richiebful commented 6 years ago

Never mind, I figured it out. Forgot that XSLT does this by default.

richiebful commented 6 years ago

Ended up flattening the mystem output and then dumping the result back into TEI manually since otherwise it would require a merger of two XML files, which would be messy. Will commit asap

djbpitt commented 6 years ago

@richiebful Sometimes messy is cool, and sometimes you just want to get the job done. So flattening and pasting manually is legitimate, but let me know if you’d like to try the merge.