johnfactotum / foliate

Read e-books in style
https://johnfactotum.github.io/foliate/
GNU General Public License v3.0
5.28k stars 257 forks source link

Add Coleman-Liau and automated readability index #372

Open digitalethics opened 4 years ago

digitalethics commented 4 years ago

It would be great to provide a way to automatically calculate and display the Coleman-Liao index (CLI) and the Automated readability index (ARI) for an ebook. More generally, such readability scores measure the complexity of a text and provide valuable insights into whether a text is suitable for a particular reader (in education) and to quickly evaluate the complexity of a text (for translation). There are several readability formulas but the two that I find most useful are CLI and ARI.

This request - alongside perhaps some others on the issue tracker - opens up a way to imagine Foliate to become analytical, to provide tools for the analysis of texts and to solve problems of how to visualize such kind of data. I understand that this may be out of scope, however, when I look at the evolution of Foliate, it is now growing from being a static ebook viewer/reader to incorporating functionalities of cataloging and managing ebooks in a library/gallery.

Nevertheless, I think it's important for Foliate to grow sustainably and I hope that the many features requested by community members around Foliate are not distracting you to let the application mature in a more limited set of core functionalities first. Especially since I have also made a request to make Foliate social (via group annotation, etc.) to a certain degree (and JavaScript is probably very suitable for this). Thank you for being so responsive and open to all kinds of user requests and discussions, and your continued dedication to Foliate and the community that is forming around it (open source publishers, academics, students, readers and writers).

johnfactotum commented 4 years ago

I think this would probably make more sense as a plugin.

digitalethics commented 4 years ago

I had a look at this and here's a brief write-up of my thoughts:

Questions:

  1. How do you go about finding what is already out there on this issue? Do you go to a portal like Libraries.io or do you just search GitHub for packages for the quantitative analysis of textual data?

  2. What best practices are there when selecting and integrating open source packages and do you build a package shortlist of suitable candidates? What are key factors to consider when making a decision about what package to choose for your project? I guess these are more general question as well.

  3. How to quickly assess code quality and reduce risk? How to choose between imperfect solutions if not opting for writing a new implementation from scratch?

  4. How could this data be visualized for Foliate readers to benefit most from it? So the idea here is not to merely aggregate and list quantitative data from textual analysis but to visually present it in the most intuitive and accessible way. After all, this is meant to be a tool for all readers, not just data scientists or linguists.

  5. And lastly, where does all this fit best in the user interface?

Findings: Textstatistics.js as well as the words project (linguistic JavaScript modules) look interesting to me and both offer (among many others), coleman-liau and automated-readability.

I think there could be two basic ways to visualize the data, perhaps in the Foliate sidebar in a "graphical way" and as a "highlight overlay" over the actual text as demonstrated via Titus' Wooorm website.

Here's some additional background info on calculating readability scores for content, a good overview of the different readability formulas, albeit for the quanteda R package, and an analysis of problems related to the Readable.io API that I initially referred to and that are worth taking into account.