Error in Understanding Regular Expressions

tanseyem commented 8 years ago

I am receiving an error with the following step in the Understanding Regular Expressions lesson (http://programminghistorian.org/lessons/understanding-regular-expressions). The problem is with this instruction:

Next, again using find-and-replace,

replace all $ (just a dollar sign) with nothing.

There are 225 replacements with this pattern. At first it may not be clear what happened here, but this has in fact made each paragraph a single paragraph or logical line.

However, when I get to this step, I get a dialog that 290 replacements were made. And in fact, it seems to remove all breaks to make one long running block of text, instead of (what I think is the intent of) removing only line breaks and retaining paragraph breaks.

Can you please advise so I can proceed with the lesson? I have been spinning my wheels trying to figure out a solution, otherwise I would suggest a fix.

Here is a screenshot: 2016-01-27_error

wcaleb commented 8 years ago

@acrymble Have we contacted the author to take a look at this? (Sorry for the slow response, @tanseyem.)

acrymble commented 8 years ago

Sorry, I just sent him an email.

knoxdw commented 8 years ago

Thanks for this report, @tanseyem. I am sorry not to have seen this earlier.

The instructions in the lesson are out of date. In current versions of LibreOffice writer the '$' character at the end of a longer regular expression pattern anchors that pattern to the end of a paragraph, but it doesn't match the paragraph break itself, and so when we replace with nothing, the breaks are still there, which you correctly understood was not the intended result.

However, in current versions of LibreOffice we can replace paragraph breaks themselves as long as the pattern to find consists just of '$' and nothing else. So one strategy we can use is to replace the paragraph breaks with some new character not already in the text, then carry out any matches that need to cross lines using that substitute character instead, and finally replace again to put the line breaks back once we have them fixed up.

Specifically, noting that there are no instances of '#' in our text, let's use that as a temporary stand-in for line breaks.

1) With regular expressions turned on, replace "$" with "#".

2) Replace "- #" (hyphen-space-hash) with nothing. This will close up "tuber-" and "culosis" on separate lines into "tuberculosis" on one line.

3) Replace "##" with "\n". This will treat double line breaks as breaks between records, separating things out into rows when we later paste into LibreOffice Calc.

4) Replace "#" with ' ' (a single space). This will get rid of line breaks that were not paragraph breaks in the original text.

At this point I think it should be possible to pick up again with the section "Finding Structure in Columns" and continue from there.

Software and programming language that handle regular expressions often differ from each other (and sometimes themselves over time) in how they handle line breaks. It's not entirely surprising that this part was fragile, but I regret that this issue caused difficulties, particularly so early in the lesson.

ianmilligan1 commented 8 years ago

Resolved in #236! Thanks @knoxdw and @tanseyem!

programminghistorian / jekyll

Error in Understanding Regular Expressions #179