Single emoji comments maybe systematically rejected as "That didn't work"?

ofsound commented 5 years ago

sudara commented 5 years ago

Rails believes the comments are duplicates.

sudara commented 5 years ago

Yikes, this is pretty open

https://github.com/sudara/alonetone/blob/b55f57db2013f5c538fd74277548c81f7fc00e4c/app/models/comment.rb#L41

sudara commented 5 years ago

Mysql considers these 💀 and 🤑 emojis to be equivalent:

irb(main):003:0> Comment.where(remote_ip: '192.168.1.107', body: '🤑')
  Comment Load (115.8ms)  SELECT `comments`.* FROM `comments` WHERE `comments`.`deleted_at` IS NULL AND `comments`.`remote_ip` = '192.168.1.107' AND `comments`.`body` = '🤑' LIMIT 11
=> #<ActiveRecord::Relation [#<Comment id: 238128, commentable_type: "Asset", commentable_id: 112990, body: "💀", created_at: "2019-09-11 20:26:51", updated_at: "2019-09-11 20:26:51", commenter_id: 1, user_id: 1157, remote_ip: "192.168.1.107", user_agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_1 like Mac O...", referrer: "http://neve.local:3000/", is_spam: false, private: false, body_html: nil, deleted_at: nil>]>

Ruby doesnt:

irb(main):004:0> '🤑' == '💀'
=> false

sudara commented 5 years ago

Not all of mysql on production is set to utf8mb4 :/

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8               |
| character_set_connection | utf8               |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8               |
| character_set_server     | utf8               |
| character_set_system     | utf8               |
| collation_connection     | utf8_general_ci    |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8_unicode_ci    |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

sudara commented 5 years ago

https://mathiasbynens.be/notes/mysql-utf8mb4

sudara commented 5 years ago

I set this on production server config

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

sudara commented 5 years ago

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

sudara commented 5 years ago

Still seeing the issue despite collation being correct:

mysql> show table status like 'comments';
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
| Name     | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time         | Check_time | Collation          | Checksum | Create_options | Comment |
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
| comments | InnoDB |      10 | Dynamic    |   19 |            862 |       16384 |               0 |        65536 |         0 |     1069363021 | 2019-09-11 19:06:30 | 2019-09-12 20:55:03 | NULL       | utf8mb4_unicode_ci |     NULL |                |         |
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
1 row in set (0.06 sec)

sudara commented 5 years ago

Sushi-beer issue? https://github.com/rails/rails/issues/33596#issuecomment-412687948

sudara commented 5 years ago

Mysql itself does view these strings as eqiv.

mysql> SELECT STRCMP('💀','🤑');
+-----------------+
| STRCMP('?','?') |
+-----------------+
|               0 |
+-----------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP('💀','123');
+-------------------+
| STRCMP('?','123') |
+-------------------+
|                 1 |
+-------------------+
1 row in set (0.00 sec)

sudara commented 5 years ago

Answer seems to be removing the collation from the server config and using Mysql 8.0

mysql> SELECT STRCMP('💀','🤑');
+-----------------+
| STRCMP('?','?') |
+-----------------+
|              -1 |
+-----------------+
1 row in set (0.01 sec)

sudara commented 5 years ago

After upgrading to percona 8.x, it seems happy, also through ruby


irb(main):013:0> string = "SELECT STRCMP('💀','🤑');"
=> "SELECT STRCMP('\xC3\xB0\xC2\x9F\xC2\x92\xC2\x80','\xC3\xB0\xC2\x9F\xC2\xA4\xC2\x91');"
irb(main):014:0> result = ActiveRecord::Base.connection.execute(string)
=> #<Mysql2::Result:0x0000561e47018518 @query_options={:as=>:array, :async=>false, :cast_booleans=>false, :symbolize_keys=>false, :database_timezone=>:utc, :application_timezone=>nil, :cache_rows=>true, :connect_flags=>2148540933, :cast=>true, :default_file=>nil, :default_group=>nil, :adapter=>"mysql2", :encoding=>"utf8mb4", :pool=>50, :variables=>{"sql_mode"=>"TRADITIONAL"}, :flags=>2}, @server_flags={:no_good_index_used=>false, :no_index_used=>false, :query_was_slow=>false}>
irb(main):015:0> result.first
=> [1]
``

sudara commented 5 years ago

Rails still not happy:

irb(main):014:0>     Comment.where(remote_ip: ip, body: '💀').where('created_at > ?', 1.hour.ago).first
  Comment Load (112.2ms)  SELECT `comments`.* FROM `comments` WHERE `comments`.`deleted_at` IS NULL AND `comments`.`remote_ip` = '167.15.12.1' AND `comments`.`body` = '💀' AND (created_at > '2019-09-12 19:02:48.030817') ORDER BY `comments`.`id` ASC LIMIT 1
=> #<Comment id: 238129, commentable_type: "Asset", commentable_id: 112992, body: "🤑", created_at: "2019-09-12 20:01:49", updated_at: "2019-09-12 20:01:49", commenter_id: nil, user_id: 2760, remote_ip: "167.15.12.1", user_agent: nil, referrer: nil, is_spam: false, private: false, body_html: nil, deleted_at: nil>

sudara commented 5 years ago

Going to give up here and just let single characters through for now.

sudara commented 5 years ago

@Manfred You might have ideas here...

Manfred commented 5 years ago

It looks like MySQL and Percona use version 4 of the Unicode collation data (from 2005) by default and that does not have support for newer emoji. I would expect them to fall back to binary collation for missing codepoints, but apparently they are not that smart or it's just a bug. You have to figure out the most recent collation supported by the system and use that.

SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%utf8mb4%' order by COLLATION_NAME;

You will see numbers like 0900 (version 9.0) and 520 (version 5.2). The trick is to find the latest and preferably one that also includes unicode for language independent collation or otherwise one with en in it because most of the text is going to be English anyway. The most recent published Unicode version is 12.1.0.

On the Mac and possibly also on Linux there is utf8mb4_0900_as_cs, which is probably what you want. It stands for [version 9.0, Accent Sensitive, Case Sensitive](https://www.monolune.com/what-is-the-utf8mb4_0900_ai_ci-collation/.

You can test the collations by adding them to a query:

SELECT * FROM comments WHERE body = '💀' COLLATE utf8mb4_0900_as_cs;
SELECT * FROM comments ORDER BY body COLLATE utf8mb4_0900_ai_ci;

sudara commented 5 years ago

I think we are ok an the db level, as mysql 8.x has unicode v9.0 as the default.

https://mysql.wisborg.dk/2018/07/28/which-character-set-should-you-use-in-mysql/

The default collation for utf8mb4 in MySQL 8.0 is utf8mb4_0900_ai_ci

Manfred commented 5 years ago

I would prefer utf8mb4_0900_as_cs, but that could be a regional preference.

sudara / alonetone

Single emoji comments maybe systematically rejected as "That didn't work"? #732