instasent / sms-counter-php

SMS Counter and sanitize tools
MIT License
49 stars 28 forks source link

Bug counting messages with emoji #15

Closed cmorbitzer closed 5 years ago

cmorbitzer commented 5 years ago

It seems now that emoji are counted as 1 character when they occupy the space of 2 characters when encoded with UTF-16.

Given this statement:

$message = "😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎";
var_dump((new SMSCounter())->count($message));

This used to output:

{
  +"encoding": "UTF16"
  +"length": 139
  +"per_message": 67
  +"remaining": 62
  +"messages": 3
}

but now outputs:

{
  +"encoding": "UTF16"
  +"length": 70
  +"per_message": 70
  +"remaining": 0
  +"messages": 1
}

It would seem from my limited understanding of SMS character encoding that the former is correct.

Related to #10 and introduced by #14

juliangut commented 5 years ago

The difference is length was due to utf8ToUnicode method returning invalid Unicode codification for 😎 char, it returned 2008 instead of correct 128526 (2008 actually is U+7D8 => ߘ)

Due to invalid identification of char it returned longer Unicode char array, almost double, making a 139 UTF16 char list is longer than maximum UCS2 length so it is chopped

Then

var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(139) {
  [0] => 
  int(128526)
  [1] => 
  int(944)
  [2] => 
  int(2008)
  [3] => 
  int(944)
  [4] => 
  int(2008)
  [...]
}

var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#18 (5) {
  public $encoding =>
  string(5) "UTF16"
  public $length =>
  int(139)
  public $per_message =>
  int(67)
  public $remaining =>
  int(0)
  public $messages =>
  int(3)
}

Current code incorrectly encodes 😎 char but thanks to #16 when correctly detecting the character as 128526 (U+1F60E) it returns actual length (70) which fits into a UCS2 message

Now

var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(70) {
  [0] => 
  int(128526)
  [1] => 
  int(128526)
  [2] => 
  int(128526)
  [3] => 
  int(128526)
  [4] => 
  int(128526)
  [...]
}

var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#16 (5) {
  public $encoding =>
  string(5) "UTF16"
  public $length =>
  int(70)
  public $per_message =>
  int(70)
  public $remaining =>
  int(0)
  public $messages =>
  int(1)
}

If we extend the string to 80 chars it gets chopped into 2 parts

var_dump((new SMSCounter())->utf8ToUnicode('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
array(80) {
  [0] =>
  int(128526)
  [1] =>
  int(128526)
  [2] =>
  int(128526)
  [3] =>
  int(128526)
  [4] =>
  int(128526)
  [...]
}

var_dump((new SMSCounter())->count('😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎😎'));
class stdClass#16 (5) {
  public $encoding =>
  string(5) "UTF16"
  public $length =>
  int(80)
  public $per_message =>
  int(67)
  public $remaining =>
  int(54)
  public $messages =>
  int(2)
}

Let me know if you still think it is behaving incorrectly, I could of course be wrong

cmorbitzer commented 5 years ago

I'm not completely familiar with SMS encoding myself, but Twilio here recommends this tool to count segments, and that tool says that 3 segments is correct.

Additionally, sending this message with Twilio results in 3 segments being sent. Here is the result:

Screen Shot 2019-06-03 at 10 31 10 AM
juliangut commented 5 years ago

My undestanding with SMS encodings is not so good either but after investigating a bit further Unicode chars over U+10000 () fall in UCS4 and is encoded as two 16-bit characters, so in our previous examples this effectively doubles the length of the message

I've updated PR #16 to reflect this requirement. This would fix the length error introduced in #14 while keeping correct Unicode encoding

@cmorbitzer could you please verify it is returning correctly now according to your tests?

cmorbitzer commented 5 years ago

The update takes care of this issue. Thanks!