Open jmecn opened 1 month ago
icu4j 可以识别字符串的语言,并提供了bidi算法来识别双向文本。不过 icu4j 不太能处理emoji。也许需要 emoji-java 或者 emoji-segmenter 来先识别emoji。
The icu4j seems to be a good solution for this issue.
UScriptRun
can split text into different script, Bidi
can split text into different writing direction.
import com.ibm.icu.lang.UScript;
import com.ibm.icu.lang.UScriptRun;
import com.ibm.icu.text.Bidi;
class TestIcu4j {
@Test void testUScriptRun() {
String text = "Love and peace" +// latin
"爱与和平" +// Han
"الحب والسلام" + // Arabic
"사랑과 평화" // Hangul
;
UScriptRun run = new UScriptRun(text);
while (run.next()) {
int start = run.getScriptStart();
int limit = run.getScriptLimit();
int script = run.getScriptCode();
System.out.printf("Script %s from %d to %d\n", UScript.getName(script), start, limit);
}
// output:
// Script Latin from 0 to 14
// Script Han from 14 to 18
// Script Arabic from 18 to 30
// Script Hangul from 30 to 36
}
@Test void testBidi() {
String text = "Love and peace" +// latin
"爱与和平" +// Han
"الحب والسلام" + // Arabic
"사랑과 평화" // Hangul
;
Bidi bidi = new Bidi(text, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);
System.out.printf("isMixed:%b, runCount:%d\n", bidi.isMixed(), bidi.getRunCount());
for (int i = 0; i < bidi.getRunCount(); i++) {
int start = bidi.getRunStart(i);
int limit = bidi.getRunLimit(i);
System.out.printf("start=%d, limit=%d, level=%d\n", start, limit, bidi.getRunLevel(i));
}
// 0-left_to_right, 1-right_to_left
// output:
// isMixed:true, runCount:3
// start=0, limit=18, level=0
// start=18, limit=30, level=1
// start=30, limit=36, level=0
}
}
Emoji Unicode Technical Standard
检测一个 String 中的不同成分,将其按语言、标点符号、emoji来分段。这需要识别出不同的语言,接下来才能使用harfbuzz正确查找glyphIndex 可以参考 pango 的itemize.c
https://gitlab.gnome.org/GNOME/pango/-/blob/main/pango/itemize.c
以及 JavaScript 的 franc 库
https://github.com/wooorm/franc
chrome 源代码
https://github.com/chromium/chromium/blob/main/third_party/blink/renderer/platform/fonts/script_run_iterator_test.cc
emoji-segmenter
https://github.com/google/emoji-segmenter