silverstripe / silverstripe-framework

Silverstripe Framework, the MVC framework that powers Silverstripe CMS
https://www.silverstripe.org
BSD 3-Clause "New" or "Revised" License
721 stars 821 forks source link

[ORM] DBText::Summary() Doesn't summarise non-latin based languages. (Chinese, Korean, Japanese) #8190

Open caffeineinc opened 6 years ago

caffeineinc commented 6 years ago

Affected Version

Affects SilverStripe Framework 4.1.1

cwp/agency-extensions               2.0.0                       Base module for the default CWP theme to add features
cwp/cwp-recipe-cms                  2.0.1                       CWP CMS requirements recipe
cwp/cwp-recipe-core                 2.0.1                       CWP core requirements recipe
cwp/cwp-recipe-search               2.0.1                       CWP search requirements recipe
cwp/starter-theme                   2.0.0                       CWP Theme
cwp/watea-theme                     dev-master cd1bdf8          CWP Watea theme
phpunit/phpunit                     5.7.27                      The PHP Unit Testing framework.
silverstripe/asset-admin            dev-bugfix/cwp-2168 7b81608 Asset management for the SilverStripe CMS
silverstripe/recipe-authoring-tools 1.0.0                       Extra tools for CMS authoring in SilverStripe
silverstripe/recipe-blog            1.0.0                       SilverStripe Blog Project Template
silverstripe/recipe-collaboration   1.0.0                       Add extra functionality to enhance CMS user collaboration
silverstripe/recipe-form-building   1.0.0                       A recipe of modules to help you build forms in SilverStripe
silverstripe/recipe-plugin          1.2.0                       Helper plugin to install SilverStripe recipes
silverstripe/recipe-reporting-tools 1.0.0                       Add extra CMS reporting tools to your SilverStripe project
silverstripe/recipe-services        1.0.0                       Add API and content service modules to your SilverStripe project
silverstripe/registry               dev-master d8976e6          Provide search and export interfaces for datasets.
silverstripe/subsites               2.0.2                       Run multiple sites from a single SilverStripe install.
squizlabs/php_codesniffer           3.3.0                       PHP_CodeSniffer tokenizes PHP, JavaScript and CSS files and detect...
tractorcow/silverstripe-fluent      4.0.2                       Simple localisation for Silverstripe

Description

DBText::Summary() doesn't summarise non latin based language (Chinese, Korean etc) paragraphs. East asian languages use a different kind of full stop, and word counts are based on characters instead of latin based "words".

See Full stops in other scripts - https://en.wikipedia.org/wiki/Full_stop https://en.wiktionary.org/wiki/ideographic_full_stop

Steps to Reproduce

$textObj = DBField::create_field('HTMLFragment', '<p>新西兰是一个拥有160余种语言的国家,这体现了我们国家的多元性。总结字段也应根据中文表意句号来分割。</p><p>这个测试应该有助于掩盖它。</p>');
$result = $textObj->obj('Summary', [50])->forTemplate();

$result contains the full text as it fails to distinguish the Chinese ideographic full stop

Note: I've got a PR incoming with a change to use str_word_count instead of regular expressions splitting on a period ..

thejimu commented 2 years ago

Ooof. Open for like 4 years and still an issue. Major pain in my ass.

GuySartorelli commented 9 months ago

FYI There's a lot of extra discussion about this in https://github.com/silverstripe/silverstripe-framework/pull/8191