stdlib-js / stdlib

✨ Standard library for JavaScript and Node.js. ✨
https://stdlib.io
Apache License 2.0
4.54k stars 508 forks source link

Algorithm to reverse a string should account for extended grapheme clusters #299

Closed kgryte closed 1 year ago

kgryte commented 4 years ago

Checklist

Please ensure the following tasks are completed before filing a bug report.

Description

Description of the issue.

Encountered an error when attempting to use @stdlib/string/reverse to reverse a string which includes extended grapheme clusters. While the implementation does account for Unicode surrogate pairs and combining marks, extended grapheme clusters do not appear supported.

Addressing this issue may require using a Unicode segmentation algorithm, as discussed in the Unicode standard. One can find reference implementations in Rust and elsewhere.

Related Issues

Does this issue have any related issues?

Related issues #295.

Questions

Any questions for reviewers?

No.

Other

Any other information relevant to this issue? This may include screenshots, references, stack traces, sample output, and/or implementation notes.

N/A

Demo

If relevant, provide a link to a live demo.

N/A

Reproduction

What steps are required to reproduce the unexpected output?

In order to reproduce this bug, do the following:

Expected Results

What are the expected results?

The reversed string to account for extended grapheme clusters.

Actual Results

What are the actual results?

N/A

Environments

What environments are affected (e.g., Node v0.4.x, Chrome, IE 11)? If Node.js, include the npm version, operating system, and any other potentially relevant platform information.

The following environments are affected:

gardhr commented 4 years ago

Try something like this:


function reverse(str) {
  var result = "";
  var smx = str.length;
  var sdx = 0;
  /*
   Can't start at the end because each segment depends on the 
   previous one. So build up a list of lengths first instead. 
*/
  var counts = [];
  for (;;) {
    var skip = str.codePointAt(sdx) <= 0xffff ? 1 : 2;
    counts.push(skip);
    sdx += skip;
    if (sdx >= smx) break;
  }

  var cmx = counts.length;
  while (cmx--) {
    var skip = counts[cmx];
    /*
   Now we know how much to hop backward for the next grapheme.
   Write the first segment, then the second (if present).
*/
    var hop = skip;
    while (hop) result += str[smx - hop--];

    /*
     Move on to the next glyph. (Bounds-checking won't be needed
     here since the counts array holds an accurate list of widths.)
*/
    smx -= skip;
  }
  return result;
}
kgryte commented 4 years ago

@gardhr Thanks for the suggestion! Feel free to submit a PR along with tests. :)

Not sure how well your proposal accounts for all graphemes, as it has been a while since I last looked into it.

gardhr commented 4 years ago

Sorry, I just don't have the time at the moment. It has worked well for me over the years, I would be rather surprised if it choked on anything at all. If it does however, just give me a set of failing samples and I'll be glad to fix it.

Wish I could do more to help. Maybe once I get passed some of this backlog?

var print = console.log;
var text =
  "外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲\u{2F804}";
var reversed = reverse(text);
print(text);
print();
print(reversed);
print();
print("Test:", reverse(reversed) == text ? "Pass" : "FAIL!");

Output:

外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲你

你♲★☃𠁨𠁎𠀾𠀋хыннартсони 😈 ησηθὰμκΕ ,😭 ილუთრაქ ,😃 學教文語าษาภนอสะลแนยีรเαίλακσαδιΔ ιακ ησηθὰμκΕ ήκισσωλΓhcycbo wókyzęj akuaN기치르가 와기우배말,cọh ựS ,ữgN nôgNaihq biaht mwak suLŋiʃt:it' dnæ ŋin:rɘl' ʒdiwɡŋæl'აბელვაწს ად ალვაწსეშ სინე ილუთრაქицизЕ иджуЧ ан енавадоперП и енавачузИsamoidi ed oidutse y aznañesnE学教文语・學教文語вокызя хыннартсони еинечубо и еинечузИgnihcaeT dna gninraeL egaugnaL授教と習学の語国外

Test: Pass
kgryte commented 4 years ago

@gardhr Thanks for the demo!

gardhr commented 4 years ago

Yep, you're welcome.

kgryte commented 1 year ago

This was resolved in https://github.com/stdlib-js/stdlib/pull/1082.