Phrase frequency analysis ignores punctuation, treats contractions as multiple words

amconners commented 2 years ago

When using the Frequency Analyzer to analyze phrase frequency, punctuation isn't taken into account.

The last word of one sentence and the first word of a new sentence constitute a valid phrase according to the software, though the period between them won't show up in the frequency analysis at all. I'm not sure if ignoring sentence boundaries is the intended behavior or not but either way the punctuation should at least be displayed.

This also means contractions are treated as if they're two words. In the analysis results, the apostrophe is replaced with a space; e.g. "couldn't" shows up as "couldn t", which also has the side effect of causing the phrase length setting to give incorrect results. For instance, setting the minimum number of words to 3 will still include the phrase "weren't here" (showing up as "weren t here", of course).

Tested on the package repository version of Manuskript 0.13.1 on Arch Linux.

jdanielp commented 2 years ago

I think it will be helpful to set expectations before I provide a fix.

If there is a 4-word phrase "george ate the banana" it could show up in the following ways:

A: george ate the banana. B: After school george ate. the banana was tasty.

Should both A and B count towards the same phrase?

Or, should there be two phrases P1 and P2, counted separately, with the same words, but also not identical because the period? P1: "george ate. the banana" P2: "george ate the banana"

I think we should have a radio button on the Phrase Frequency Analyzer with choices for how to handle punctuation as well as a field for the user to specify the punctuation. Then, we can handle both "expected" behaviors. For P2, I think we need to include the "." punctuation and post-processes the result of re.findall() to remove the single entries of "." and append the "." to the word it follows. See ...\ui\tools\frequencyAnalyzer.py line 49 and 50.

I'll play around with this, but your feedback would be appreciated on the ideas I've outlined above.

jdanielp commented 2 years ago

I have a fix that produces the following behavior:

You'll notice that there are three different phrases with the words "of the world", each with different punctuation (rows 2, 7, and 10).

If this is OK, I will start on modifying the UI to add the options.

Edit: Added row references for clarity.

amconners commented 2 years ago

My expectation would be that B should not count at all, since it's the last part of one phrase and the first part of another. (When I think of a phrase in my writing, I think of it as a smaller part of a sentence or clause.) But as your comment indicates there's different expectations besides just this one, and I don't think any are better or worse than the others.Your radio button idea sounds like a good one but I'd suggest a third option for it, i.e. being able to choose to count phrases with punctuation as the same phrase, as a different phrase, or not at all.Another way to do it would be to have multiple text boxes for the settings so different punctuation could be handled differently (someone could want to see phrases that cross comma or semicolon boundaries as different phrases, phrases with an ellipsis as one phrase, and not have phrases across periods count at all, for example). And a fourth box to define what punctuation doesn't make something count as two words that defaults to having the apostrophe in it would help with the contraction bug but I think it could also be useful for people with differing opinions on how to count hyphenated words. I'm not sure if it's worth the effort in practice but I think it's at least worth putting out there for discussion.

Edit: Your solution seems like a step in the right direction regardless of what other behaviors we also want to include, so I'd say go ahead and modify the UI unless you think the text box idea merits further discussion.

jdanielp commented 2 years ago

Thanks for the feedback! I like your ideas about different punctuation having different behaviors. I have enough to get started on a prototype. I'll have to look at the word count / contraction bug also before I get too deep into this issue.

amconners commented 2 years ago

If there's a way to treat an apostrophe as a letter for the purposes of the frequency analysis, I think that would take care of the contraction bug, but without familiarity with the code I can't be sure.But then you'd have to consider what it'd do with something ending in punctuation that's enclosed in single quotes, so it might just be trading one problem for another. Maybe if there's a way to count it as a letter if it's preceded and followed by letters... I guess at this point I'm just brainstorming.Thanks for working on the implementation, though!

olivierkes / manuskript

Phrase frequency analysis ignores punctuation, treats contractions as multiple words #997