nischayn22 / PageQuality

A MediaWiki extension to monitor and improve Page Quality
MIT License
0 stars 1 forks source link

Rudimentary fix for PageQualityScorerReadability to work for Hebrew #1

Closed drorsnir closed 3 years ago

drorsnir commented 3 years ago

Apparently, DOMDocument::loadHTML() needs the encoding, and str_word_count() doesn't work for unicode. I fixed the first, and included a replacement for str_word_count(). This isn't necessarily the correct solution for word count, just one that somewhat works (see links).

included a @todo - load only actual page content. right now this will also load stuff like the "protectedpagewarning" message, which might trigger some of the scorers in the future.

drorsnir commented 3 years ago

Please note this currently doesn't count correctly - Amitay found a sentence with 14 words and a dot that was somehow over the 15 words limit.