peterbussch / stalinletters

We are a group from SLAV 1050: Computational Methods in the Humanities at the University of Pittsburgh. We are creating a public-facing website that aims at expanding the reach of a valuable set of historical documents.
1 stars 0 forks source link

Project Update #5 #7

Open peterbussch opened 3 years ago

peterbussch commented 3 years ago
  1. This week, we worked on updating the schema, marking up the various names featured in the correspondence, and laying the framework for our HTML/web design.
  2. The Regex expressions we used for the auto-tagging were pretty similar to one another, since we had to deal with the Russian root of a word, plus its various case endings. To those that are unfamiliar, Russian words can take a variety of endings depending on their context and case.
  3. This led us to discover a rather simple formula for auto-tagging. Here is Молотов, as an example: find: молот[ЁёА-я]* and replace with: <person who="Molotov">\0</person>
  4. The process was repeated for И.стал[ЁёА-я]* \sстали[ЁёА-я]* (to find instances of Сталин when it's not written И.Сталин) бухар[ЁёА-я]* троц[ЁёА-я]* Дзержинск[ЁёА-я]* зинов[ЁёА-я]* рык[ЁёА-я]* камен[ЁёА-я]* Томск[ЁёА-я]* Рудзу[ЁёА-я]* Лозовск[ЁёА-я]* Калини[ЁёА-я]*
  5. The Schema was updated to feature these different names
  6. Ленин.{1,4}\s was a little different, because we have to account for Ленинград, which will be marked up separately with other place names.
raisedDeadWizard commented 3 years ago

I think it is really awesome you guys are using find and replace to speed up your markup. Our group was trying to find something similar to make character tagging a little easier in our tales, but alas we couldn't find anything consistent that marked everything since characters are referred to by different names. We also began working on our website, primarily on planning, but still working on it nonetheless. How is your group planning on structuring your project site? Is there a site we've looked at in class that you're planning on using as an example? I highly recommend that last one, going and looking at older projects made our organization questions a lot easier to answer.

djbpitt commented 3 years ago

@jeepy33 Autotagging text that refers to persons who can be specified in different ways is challenging for exactly the reason you mention: different words may refer to the same person. Peter's strategy is very capable: find a regex that matches the ways we can refer to a person and use that to tag the names. The Leningrad situation (it contains the personal name "Lenin" as a substring) might be managed by tagging Leningrad first, e.g., <place what="city" name="Leningrad">Leningrad</place> and then restricting the scope of subsequent replacements by using the XPath widget in the find-and-replace dialog. The key detail is to tag the larger string first (so "Leningrad" before "Lenin") and then use the markup you've introduced to specify the XPath context for the next replacement (e.g., "Lenin") in a way that excludes elements. For example, if you've tagged already for paragraphs, your XPath expression could say that you tag "Lenin" only when it's in a text-node child of <p>, which will implicitly mean that it won't be tagged if it's in a <place> inside a <p>, since then it would not be a child of <p>.