Open ofsound opened 5 years ago
Rails believes the comments are duplicates.
Yikes, this is pretty open
Mysql considers these 💀 and 🤑 emojis to be equivalent:
irb(main):003:0> Comment.where(remote_ip: '192.168.1.107', body: '🤑')
Comment Load (115.8ms) SELECT `comments`.* FROM `comments` WHERE `comments`.`deleted_at` IS NULL AND `comments`.`remote_ip` = '192.168.1.107' AND `comments`.`body` = '🤑' LIMIT 11
=> #<ActiveRecord::Relation [#<Comment id: 238128, commentable_type: "Asset", commentable_id: 112990, body: "💀", created_at: "2019-09-11 20:26:51", updated_at: "2019-09-11 20:26:51", commenter_id: 1, user_id: 1157, remote_ip: "192.168.1.107", user_agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_1 like Mac O...", referrer: "http://neve.local:3000/", is_spam: false, private: false, body_html: nil, deleted_at: nil>]>
Ruby doesnt:
irb(main):004:0> '🤑' == '💀'
=> false
Not all of mysql on production is set to utf8mb4 :/
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)
I set this on production server config
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
Still seeing the issue despite collation being correct:
mysql> show table status like 'comments';
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
| comments | InnoDB | 10 | Dynamic | 19 | 862 | 16384 | 0 | 65536 | 0 | 1069363021 | 2019-09-11 19:06:30 | 2019-09-12 20:55:03 | NULL | utf8mb4_unicode_ci | NULL | | |
+----------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+
1 row in set (0.06 sec)
Sushi-beer issue? https://github.com/rails/rails/issues/33596#issuecomment-412687948
Mysql itself does view these strings as eqiv.
mysql> SELECT STRCMP('💀','🤑');
+-----------------+
| STRCMP('?','?') |
+-----------------+
| 0 |
+-----------------+
1 row in set (0.00 sec)
mysql> SELECT STRCMP('💀','123');
+-------------------+
| STRCMP('?','123') |
+-------------------+
| 1 |
+-------------------+
1 row in set (0.00 sec)
Answer seems to be removing the collation from the server config and using Mysql 8.0
mysql> SELECT STRCMP('💀','🤑');
+-----------------+
| STRCMP('?','?') |
+-----------------+
| -1 |
+-----------------+
1 row in set (0.01 sec)
After upgrading to percona 8.x, it seems happy, also through ruby
irb(main):013:0> string = "SELECT STRCMP('💀','🤑');"
=> "SELECT STRCMP('\xC3\xB0\xC2\x9F\xC2\x92\xC2\x80','\xC3\xB0\xC2\x9F\xC2\xA4\xC2\x91');"
irb(main):014:0> result = ActiveRecord::Base.connection.execute(string)
=> #<Mysql2::Result:0x0000561e47018518 @query_options={:as=>:array, :async=>false, :cast_booleans=>false, :symbolize_keys=>false, :database_timezone=>:utc, :application_timezone=>nil, :cache_rows=>true, :connect_flags=>2148540933, :cast=>true, :default_file=>nil, :default_group=>nil, :adapter=>"mysql2", :encoding=>"utf8mb4", :pool=>50, :variables=>{"sql_mode"=>"TRADITIONAL"}, :flags=>2}, @server_flags={:no_good_index_used=>false, :no_index_used=>false, :query_was_slow=>false}>
irb(main):015:0> result.first
=> [1]
``
Rails still not happy:
irb(main):014:0> Comment.where(remote_ip: ip, body: '💀').where('created_at > ?', 1.hour.ago).first
Comment Load (112.2ms) SELECT `comments`.* FROM `comments` WHERE `comments`.`deleted_at` IS NULL AND `comments`.`remote_ip` = '167.15.12.1' AND `comments`.`body` = '💀' AND (created_at > '2019-09-12 19:02:48.030817') ORDER BY `comments`.`id` ASC LIMIT 1
=> #<Comment id: 238129, commentable_type: "Asset", commentable_id: 112992, body: "🤑", created_at: "2019-09-12 20:01:49", updated_at: "2019-09-12 20:01:49", commenter_id: nil, user_id: 2760, remote_ip: "167.15.12.1", user_agent: nil, referrer: nil, is_spam: false, private: false, body_html: nil, deleted_at: nil>
Going to give up here and just let single characters through for now.
@Manfred You might have ideas here...
It looks like MySQL and Percona use version 4 of the Unicode collation data (from 2005) by default and that does not have support for newer emoji. I would expect them to fall back to binary collation for missing codepoints, but apparently they are not that smart or it's just a bug. You have to figure out the most recent collation supported by the system and use that.
SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%utf8mb4%' order by COLLATION_NAME;
You will see numbers like 0900 (version 9.0) and 520 (version 5.2). The trick is to find the latest and preferably one that also includes unicode
for language independent collation or otherwise one with en
in it because most of the text is going to be English anyway. The most recent published Unicode version is 12.1.0.
On the Mac and possibly also on Linux there is utf8mb4_0900_as_cs
, which is probably what you want. It stands for [version 9.0, Accent Sensitive, Case Sensitive](https://www.monolune.com/what-is-the-utf8mb4_0900_ai_ci-collation/.
You can test the collations by adding them to a query:
SELECT * FROM comments WHERE body = '💀' COLLATE utf8mb4_0900_as_cs;
SELECT * FROM comments ORDER BY body COLLATE utf8mb4_0900_ai_ci;
I think we are ok an the db level, as mysql 8.x has unicode v9.0 as the default.
https://mysql.wisborg.dk/2018/07/28/which-character-set-should-you-use-in-mysql/
The default collation for utf8mb4 in MySQL 8.0 is utf8mb4_0900_ai_ci
I would prefer utf8mb4_0900_as_cs
, but that could be a regional preference.