ken107 / read-aloud

An awesome browser extension that reads aloud webpage content with one click
https://readaloud.app
MIT License
1.39k stars 236 forks source link

[BUG] Paragraph shorter than 100 chars got skipped #182

Closed tanduong closed 1 year ago

tanduong commented 4 years ago

https://github.com/ken107/read-aloud/blob/119ed7048caa5338ab10bdd900a18448fd56a500/js/content/html-doc.js#L17

This could be considered a bug because there are many paragraphs that could be much shorter than 100 chars. (this paragraph is 107 chars before the dot mark).

I wonder why we have this limit and if there has to be a limit, it should be configurable. I am using this extension that have a lot of short paragraphs and skipping like this make the content indigestible.

ken107 commented 4 years ago

Hello, thanks for the issue report.

The algorithm actually will read-aloud the parent element of any paragraph it finds. So even if your paragraph is less than 100 chars, if any of its sibling paragraphs has more than 100 chars, the entire group will be read.

The issue arise indeed when the page is marked up using div's instead of P's, or putting each paragraph in its own div. In that case, small paragraphs will be skipped.

Making that tunable parameter smaller will ensure those are included, but might result in non-essential text being included as well. It's a compromise, and we also don't have data to really fine tune it.

As you can see in the code, if it is unable to find enough text to read using threshold=100 (i.e. it cannot find a large body of text that would constitute the main article), it will try again using threshold=3, i.e. it will read everything and anything on the page.

mkturner commented 4 years ago

Is it possible for this to made available in the settings pane? I would like to configure it as low 10 or even disable it altogether. I am sure that I want all the content spoken from my source.

If you're curious to what I'm reading from I'm using the LiveBook application from Manning Publications to read their books. Also, Eloquent JavaScript.

ken107 commented 4 years ago

Are you having problem with that book? I tested and it works fine, nothing is skipped because the HTML is properly structured. Only in rare cases will you run into the issue described in this thread.

mkturner commented 4 years ago

The particular book I had issue with is React Quickly by Azat Mardan published by Manning. I am reading online with the LiveBook feature. There are several one liners at the end of sections or before diagrams. Also the Quiz questions and summary bullet points sometimes are skipped. (because they are less than 100 chars?)

ken107 commented 4 years ago

Ah, yes. I verified that indeed they're putting the [p]'s inside [div], so the default algorithm won't read the one-liners with less than 100 characters. Although, this is an e-book player, and normally we don't expect the default algorithm to work out-of-the-box for these players. Ideally we'd implement special handling for them. For example, we have special handling for Google Play, Amazon Kindle, and VitalSource e-books. We should implement one for LiveBook as well. I've added this to the special-sites ticket https://github.com/ken107/read-aloud/issues/24

But I suppose we could add an advanced settings to tweak this parameter as well. I've included it for next release.

ken107 commented 2 years ago

9a840a503ab02599ce187080d0d3e2c1d56c7e93 Reduced the threshold to 50, we will monitor to see what the impact is