peterbussch / stalinletters

We are a group from SLAV 1050: Computational Methods in the Humanities at the University of Pittsburgh. We are creating a public-facing website that aims at expanding the reach of a valuable set of historical documents.
1 stars 0 forks source link

Basic Structural Regex #2

Open peterbussch opened 3 years ago

peterbussch commented 3 years ago

-To begin with, I cut-and-pasted everything before and after the actual text of the letters (примечания) and I placed them in separate documents for the ease of Regex. We will use them later. -As we talked about during our meeting, the "№" character is unique and is utilized before the start of every letter. -However, they are sometimes used in footnotes to refer to other letters, so I did a "find all" and manually changed the ones which didn't correspond to the beginning of a letter to "номер" ("number") in order to preserve that information. -At this point, I inserted a root element <corpus> and its corresponding end-tag. Oxygen tells me that the "content of the element must consist of well-formed data," and that something is messing it up. I find the error: "«" has been OCR'd as "<<" at several points, which screws up the well-formedness. -I did a search for both \< and \> and replaced them with "«" and "»," respectively. It appeared that some of these were likely printing errors, but that will become clear as we start analyzing the letters more closely. -After this, I put together the regex for separating the letters based on the "№" character. I used "lookaround" in order to do this, which is something I used for our regex assignments earlier. The find expression I put together was (?>№)(.*?)(?=№), which actually worked perfectly on the first try. I used the replacement expression <letter>\0</letter> because I wanted to preserve the letter numbers so we can change the formatting later on (if we want to). -I decided to emphasize the "примечания" section of each letter to make it easier to separate from the text of the letter. I just did a simple find expression of примечания and replaced with <strong>Примечания</strong> to make it bold.

hcasazza commented 3 years ago

-Used regex to remove the commentary. -First, removed <strong> element and text up until </letter>. -Turn on case sensitive and dot matches all. -Find: <strong>.*?(?=</letter>) Replace: -Second, adding attribute to letter. -Find: <letter>№(.*?)(?=\s) Replace: <letter id = "\1"> -Lastly, adding date tags. -Find: (?<=>\s)[\[\(]?(\d{1,2}.*?(\d{4})\sг\.\]?) Replace: <date year = "\2">\1</date> -One error will show, manually go to <letter id = "48"> remove <date year = "1929"> and insert <date year = "1929"> after <letter id = "49"> -Keep in mind, the dates we tagged had a specific date for when the letters were written. We will need to go back and figure out what we want to do for dates that were less concise.