vkbo / novelWriter

novelWriter is an open source plain text editor designed for writing novels. It supports a minimal markdown-like syntax for formatting text. It is written with Python 3 (3.9+) and Qt 5 (5.15) for cross-platform support.
https://novelwriter.io
GNU General Public License v3.0
2k stars 102 forks source link

Readability metrics #712

Open jyhelle opened 3 years ago

jyhelle commented 3 years ago

I know the subject is a matter of discussion, and many people (including myself) give little importance to the Flesh, Gunning, SMOG or similar methods of readability assessment. However, there are publishers, especially in the children and teens literature, who are very attached to this kind of tool and expect to get the figures with the manuscript. (I have seen a target readability index specified in several translation contracts) The computation can be done manually, counting sentences and words and syllables, but it is very tedious while the computer could do it in seconds, so that might be a good point to envision including something into a future version of nW.

vkbo commented 3 years ago

I already have code for this that I wrote a couple of years ago. It is for English only, as counting syllables is not a trivial matter and has to be tailored for each language. I arrived at an algorithm that counts syllables fairly accurately without having to look up each word, which is the only really accurate method. My code calculates the Flesch–Kincaid Readability Score.

The code was once in one of the utility files of nW, but I took it out since it wasn't used. I've considered putting it back in again as well. Thanks for the reminder!

jyhelle commented 3 years ago

My level of confidence in that sort of "metrics" is such that I would say as long as it count syllables in English it would probably be OK for French as well. These are statistical-empirical tools, and measuring different parts of an homogeneous text gives (similar but) different results, so a good approximation might be sufficient

Any promoter or advocate to speak up ?

vkbo commented 3 years ago

The English function I wrote does not work for Norwegian for instance. I tested it when I wrote it just to see.

A decent first approach is to count diphthongs and single vowels. The challenge in English is that the letter y is a consonant or vowel depending on context. There are also silent vowels in endings that need to be accounted for.

The second issue is of course that the grading metric is designed for English specifically, and unless you want to just extract the fairly esoteric score, you need a grading scale for other languages too. They can of course easily be added if they exist.

I'm wondering how much work it would be to make a simple machine learning training set for this problem and then train it on a full dictionary using supervised learning.

johnblommers commented 3 years ago

Every time I run a readability tool over my writing, for example Hemingway I'm reminded that:

  1. Hemingway has no Linux version
  2. I dislike uploading my work to untrusted servers
  3. My writing is too wordy

So a readability score displayed next to the word counter would be a welcome addition to novelWriter.

It would by nice if novelWriter had a switch that would color sentences that are hard to read, suffer from passive voice, which is what the Hemingway app does. I'm afraid that might be a huge distraction from novelWriter's main reason for existing.

vkbo commented 3 years ago

Those are pretty advanced language analysis features, and probably also too heavy to run with Python. I am not a fan of tools like Hemingway anyway. While they help with grammar, which is useful, they also have a feel of over-teching writing to the point of generating uniformity and conformity. That's not a good thing in my book.

The readability score is a more neutral metric because it gives you a number without any judgement on whether it is good or bad. The readability score only needs to match the target audience's reading level.

My main concern about adding it is that it adds a tool that is language-specific. I want nW to be less English-oriented. English is not my own first language either. If we can collaborate to make it a bit more language-independent, then I'm all for adding a field next to the word counter that lists the values and the total.

The Flesch–Kincaid score is based on average syllables per word and words per sentence, so keeping track of those two values throughout the project isn't too hard. I can add this to the indexer. Syllables is still the trickier one to generalise.

vkbo commented 3 years ago

Just for reference, this is how to programmatically count syllables in English:

It is accurate enough that the mistakes it makes don't affect the final score. One issue is to determine exactly when 'y' makes a vowel sound.

We'd need a similar set of rules for each supported language, and I have no idea how well any of this would work for a non-European language.

jyhelle commented 3 years ago

there was a tool some years ago that used the OpenOffice hyphenation dictionaries to overcome the multi-lingual problem...

Edit: it still exists and even evolved, now using LibreOffice dicts... have a look at PyHyphen on Pypi It is primarily an hyphenation tool that I used to format verses in poetry booklets, but it can return syllables

vkbo commented 3 years ago

Yeah, this is what I was referring to when I mentioned that the only accurate method is to look up words (although hyphenation dictionaries don't always produce the correct syllable count). I researched this back when I wrote that code. I'm reluctant to depend on such external tools hosted on PyPi as such packages aren't always very dependable. I've had enough problems with pyenchant that I use for spell checking.

Perhaps the hyphenation package can be used to train a machine learning implementation though. It's an interesting topic in general.

Edit: Another alternative is to write a small module to parse the LibreOffice hyphenation dictionaries directly, like Pyphen does.

jyhelle commented 3 years ago

Yes, I just wanted to show it as a mean to bridge that multiple languages issue, and my idea was along the lines of the last alternative you added. We don't need an hyphenation package, and the syllables command of PyHyphen gives the individual syllables where we only need their number. So the idea was to look over the code of PyHyphen, or maybe other similar tools, and understand how we could get what we need from those dictionaries. But I have strictly no idea about the programming complexity it entails.

vkbo commented 3 years ago

A benefit of that approach is that the hyphenation dictionaries are available for download, so nW could automatically download the ones the user wants.

Looking at the Pyphen package, the parsing isn't that complicated. Since Pyphen is written in pure Python, it's also a good option to just include it in a lib folder in nW instead of depending on an external package.

vkbo commented 3 years ago

As for the suggestion by @johnblommers on integration with grammar tools like Hemingway, there's a feature issue #515 on this where the discussion is to integrate with LanguageTool, which is open source. Since you can run a local language server for free, or optionally connect to a hosted one, this may be a good fit for novelWriter. Especially since it supports multiple languages.

As for the metrics part, I will have a look at adding the syllable algorithm for English to the indexer. Perhaps if I add a proper class for syllables calculations we could try out multiple approaches. However, the very idea of counting syllables to determine complexity doesn't translate well outside of European languages, and not necessarily to Germanic languages either where multiple words are often joined into long words where one in English would keep the spaces or insert hyphens. These words aren't necessarily more complex or hard to read just because they contain many syllables. This practice is very common in Norwegian and of course infamous in German.

vkbo commented 2 years ago

I'm planning to redesign the Build Tool for the next release cycle. I've been thinking of adding an extra feature to the build tool for analysing the structure of the manuscript. Checking the complexity of the text could be one option. It requires a fair bit of processing, so it isn't well suited for the text editor.

If I add it, I'd like to give a breakdown so that the user has some idea of the various stats that go into these various metrics. It would probably make it more useful than just stating a score based on English alone.

It would also allow the user to select a way to estimate syllables. My own algorithm for English provides a good estimate, but it really is tuned to English. Letting the user select the algorithm is probably a good solution. Adding a module of such algorithms would also make it possible for other users to contribute some for other languages. It would be a nice area to accept contributions actually.