Closed cmorbitzer closed 5 years ago
The difference is length was due to utf8ToUnicode method returning invalid Unicode codification for 😎 char, it returned 2008 instead of correct 128526 (2008 actually is U+7D8 => ߘ)
Due to invalid identification of char it returned longer Unicode char array, almost double, making a 139 UTF16 char list is longer than maximum UCS2 length so it is chopped
Then
var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(139) {
[0] =>
int(128526)
[1] =>
int(944)
[2] =>
int(2008)
[3] =>
int(944)
[4] =>
int(2008)
[...]
}
var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#18 (5) {
public $encoding =>
string(5) "UTF16"
public $length =>
int(139)
public $per_message =>
int(67)
public $remaining =>
int(0)
public $messages =>
int(3)
}
Current code incorrectly encodes 😎 char but thanks to #16 when correctly detecting the character as 128526 (U+1F60E) it returns actual length (70) which fits into a UCS2 message
Now
var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(70) {
[0] =>
int(128526)
[1] =>
int(128526)
[2] =>
int(128526)
[3] =>
int(128526)
[4] =>
int(128526)
[...]
}
var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#16 (5) {
public $encoding =>
string(5) "UTF16"
public $length =>
int(70)
public $per_message =>
int(70)
public $remaining =>
int(0)
public $messages =>
int(1)
}
If we extend the string to 80 chars it gets chopped into 2 parts
var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(80) {
[0] =>
int(128526)
[1] =>
int(128526)
[2] =>
int(128526)
[3] =>
int(128526)
[4] =>
int(128526)
[...]
}
var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#16 (5) {
public $encoding =>
string(5) "UTF16"
public $length =>
int(80)
public $per_message =>
int(67)
public $remaining =>
int(54)
public $messages =>
int(2)
}
Let me know if you still think it is behaving incorrectly, I could of course be wrong
My undestanding with SMS encodings is not so good either but after investigating a bit further Unicode chars over U+10000 () fall in UCS4 and is encoded as two 16-bit characters, so in our previous examples this effectively doubles the length of the message
I've updated PR #16 to reflect this requirement. This would fix the length error introduced in #14 while keeping correct Unicode encoding
@cmorbitzer could you please verify it is returning correctly now according to your tests?
The update takes care of this issue. Thanks!
It seems now that emoji are counted as 1 character when they occupy the space of 2 characters when encoded with UTF-16.
Given this statement:
This used to output:
but now outputs:
It would seem from my limited understanding of SMS character encoding that the former is correct.
Related to #10 and introduced by #14