peterbussch / stalinletters

We are a group from SLAV 1050: Computational Methods in the Humanities at the University of Pittsburgh. We are creating a public-facing website that aims at expanding the reach of a valuable set of historical documents.
1 stars 0 forks source link

Project Update #3 #4

Open hcasazza opened 3 years ago

hcasazza commented 3 years ago

During our meeting today, we primarily focused on using regex to get rid of commentary within Stalin's letters. Since we started using XPath in class today, we briefly used it to look up the different elements within our xml. We went forward and removed the commentary with regex, successfully adjusting the xml to only contain the content of Stalin's letters. While doing this, an interesting feature we used within regex was "lookaround" or lookbehind and lookahead. This feature allowed us the option to find our start tag <strong> and everything until our end tag </letter> without capturing </letter> (If you are interested in what I am saying but need a better explanation, go to http://www.regular-expressions.info/lookaround.html). This approach also became a helpful tool for us as we moved on to adding new attributes and elements to our xml. We finished by reviewing how to use Git Bash to upload our changes to Github. Overall, we now feel we are in good position to focus more on what we would like to research with Stalin's letters.

ajm324 commented 3 years ago

It's great that you guys are at a place to integrate the new technologies we have learned into your markup and thanks for including the information on lookaround, it was interesting to read and could definitely be helpful later on. I think it is smart to narrow down your text source to focus on the specific sections that will be the most applicable to your research. Just out of curiosity, what was the source of the commentary that was removed?

peterbussch commented 3 years ago

@ajm324 The commentary came from the same .txt file of the book we are using as the source for the corpus of Stalin's letters. Here is a link to the book, in case you're interested. Keep in mind that this is the Russian print version. Here's a link to a comparable English-language version.. Thanks for your comment!

glh32 commented 3 years ago

I think it is very impressive how you all are already trying to make a plan to integrate the usage of XPath into your project, even if you do not necessarily begin to use it right away. Also, it is great that you found new uses for regex on your own that will help your individual project needs. Personally, for our project, we have not really found any need to use regex, but that might be attributed to the fact that I have only been considering what we have done in class and for the homeworks, and if I do more of my own research on the technologies we use in class, I will be able to find a wider range of use for the technologies we have been introduced to.