ivaldi / brimir

Email helpdesk built using Ruby on Rails and Zurb Foundation
http://getbrimir.com
GNU Affero General Public License v3.0
1.38k stars 300 forks source link

Error importing non UTF-8 characters in emails #188

Closed SurgeMonkeyAU closed 9 years ago

SurgeMonkeyAU commented 9 years ago

Some emails sent to BRIMIR appear to be encoded with UTF-8, however they seem to contain non UTF-8 characters. It is possible to track some of the characters using http://www.i18nqa.com/debug/utf8-debug.html

An example of the email content trying to be inserted is You’ll need to when you can clearly see that the content was supposed to have been "You'll need to" Here are some frequent occurrences

‪‪‫   
Administrator\n\n\n\n                        ‫*Important Notice*\n
Kind regards,‪‪‫\n>\n> ‪‪‫  \n>\n> Name Here

This is the error log from logs/production.log (This is from a spam marketing email as an example. Most emails with errors will contain many more encoding issues.)

I, [2015-03-20T16:54:37.254368 #15585]  INFO -- : Started POST "/tickets.json" for 127.0.0.1 at 2015-03-20 16:54:37 +1100
I, [2015-03-20T16:54:37.258107 #15585]  INFO -- : Processing by TicketsController#create as JSON
I, [2015-03-20T16:54:37.258773 #15585]  INFO -- :   Parameters: {<snipped json>}
D, [2015-03-20T16:54:37.573504 #15585] DEBUG -- :   (0.6ms)  SELECT COUNT(*) FROM `email_addresses` WHERE (`email_addresses`.`verification_token` IS NOT NULL)ESC[0m
I, [2015-03-20T16:54:37.573948 #15585]  INFO -- : Received mail (251.2ms)
D, [2015-03-20T16:54:37.574004 #15585] DEBUG -- : <email content>
D, [2015-03-20T16:54:37.581005 #15585] DEBUG -- :   User Load (1.1ms)  SELECT  `users`.* FROM `users` WHERE `users`.`email` = '<email>'  ORDER BY `users`.`id` ASC LIMIT 1
D, [2015-03-20T16:54:37.720004 #15585] DEBUG -- :    (0.4ms)BEGIN
D, [2015-03-20T16:54:37.722914 #15585] DEBUG -- :   User Exists (0.4ms)  SELECT  1 AS one FROM `users` WHERE `users`.`email` = BINARY '<email>' LIMIT 1
D, [2015-03-20T16:54:37.726141 #15585] DEBUG -- :   SQL (0.7ms) INSERT INTO `users` (`email`, `encrypted_password`, `created_at`, `updated_at`) VALUES ('<email>', '<password>', '2015-03-20 05:54:37.723497', '2015-03-20 05:54:37.723497')
D, [2015-03-20T16:54:37.729406 #15585] DEBUG -- :    (1.8ms)  COMMIT
D, [2015-03-20T16:54:37.731635 #15585] DEBUG -- :    (0.1ms)  BEGIN
D, [2015-03-20T16:54:37.734863 #15585] DEBUG -- :   SQL (0.7ms)  INSERT INTO `tickets` [....]
I, [2015-03-20T16:54:37.738072 #15585]  INFO -- : Completed 500 Internal Server Error in 479ms
F, [2015-03-20T16:54:37.742106 #15585] FATAL -- : ActiveRecord::StatementInvalid (
Mysql2::Error: Incorrect string value: '\xE2\x80\x8B\x0D\x0A\x0D...' for column 'content' at row 1: 
INSERT INTO `tickets` (`user_id`, `subject`, `content`, `message_id`, `content_type`, `created_at`, `updated_at`) 
VALUES (67, '<domain>', 
'Hi,\r\n\r\nWe specialise in social-media-management and work with ma
ny companies like  \r\nyours by taking care of their Facebook, Twitter, Linkedin and other rele  \r\nvant social media sites.\r\n\r\n\r\nExposure to these sites can be vital to
 the health of your business and the  \r\nright marketing campaign can launch your business to a new level benefiting  \r\nfrom the thousands of visitors to these sites.\r\n\r\
n\r\nWe\'d like to talk to you to listen to your needs to see if we can become a  \r\nvital asset to your company by managing your social media campaign. We have  \r\nvery affo
rdable packages in place which are designed to take care of all  \r\nyour web needs and create more enquiries,sales&profit.\r\n\r\n\r\nIf you would like to have a no-obligation
 discussion with me, please reply  \r\nwith your phone number?\r\n\r\n\r\nBest regards,\r\n\r\n<E2><80><8B>\r\n\r\n<name>\r\n\r\nSocial-Media-Specialists\r\n\r\n\r\n\r\n<C5>
<A1>mm\r\nSMM Pty Ltd\r\nHeadquarters: <details>\r\nOther Offices: <details>\r\n\r\n\r\n\r\n\r\n', 
'047d7b2e501a8909280512e15299@google.com', 
'text', 
'2015-03-20 05:54:37.732659', 
'2015-03-20 05:54:37.732659')):
  app/mailers/ticket_mailer.rb:94:in `receive'
  app/controllers/tickets_controller.rb:117:in `create'

In the above, you can see the text at fault <E2><80><8B> which matches to \xE2\x80\x8B in the query error. This is an example of how some of the content can appear in the logs/production.log

<p class=3DMsoNormal style=3D'text-autospace:none'><span s=
tyle=3D'font-size:13.0pt;font-family:"Arial",sans-serif;mso-fareast-langu=
age:EN-AU'>=E2=80=AA=E2=80=AA=E2=80=AB</span><span style=3D'mso-fareast-l=
anguage:EN-AU'><o:p></o:p></span></p>

In order to get around some of the issue, we have had to change the MySQL collation from latin1_swedish_ci to utf8_general_ci to be able to manually import the queries using the failed SQL queries from the log. However changing the collation doesn't allow the mails to be imported automatically, only via a manual process.

System Information: Ubuntu 14.04.2 LTS mysql Ver 14.14 Distrib 5.5.41, for debian-linux-gnu (x86_64) using readline 6.3

Mail Alias:

brimir: "|/bin/bash /home/brimir/brimir/script/post-mail https://localhost:3000/tickets.json"
support: support, brimir

Startup Script:

#!/bin/bash
BRIMIR=/home/brimir/brimir
PASSENGER=/usr/bin/passenger
export RAILS_ENV=production
$PASSENGER start \
--ssl \
--ssl-certificate-key=/etc/ssl/private/<ssl>.key \
--ssl-certificate=/etc/ssl/private/<ssl>.crt -d \
--user brimir \
--log-file $BRIMIR/brimir.log \
--pid-file $BRIMIR/brimir.pid
frenkel commented 9 years ago

Could you post the (anonimized) headers of the incoming email? There is probably no charset or an invalid charset in there.

SurgeMonkeyAU commented 9 years ago

Here is an example of one of the emails that contained the problem. This particular mail contained base64 encoded data, but the same issue has been present in Content-Transfer-Encoding: quoted-printable emails also.

Each section of the email appeared to be encoded as UTF-8

Return-Path: <client.name@clientdomain.com>
X-Original-To: support@brimir.ourdomain.com
Delivered-To: support@brimir.ourdomain.com
Received: from filter.ourdomain.com (filter.ourdomain.com [xxx.xx.xxx.36])
    by brimir.ourdomain.com (Postfix) with ESMTPS id 4A8C97A2D9
    for <support@brimir.ourdomain.com>; Wed, 29 Apr 2015 14:19:19 +1000 (AEST)
X-OurServers-BaruwaFW-From: client.name@clientdomain.com
X-OurServers-BaruwaFW: Not scanned: please contact OurServers for details
X-OurServers-BaruwaFW-ID: 1YnJSw-0002pS-Fe
X-OurServers-BaruwaFW-Information: Please contact OurServers for more information
Received: from [xxx.xx.xxx.15] (helo=backends.localdomain)
    by filter.ourdomain.com with esmtp (Baruwa 2.0)
    (envelope-from <client.name@clientdomain.com>)
    id 1YnJSw-0002pS-Fe ret-id none;
    for support@brimir.ourdomain.com; Wed, 29 Apr 2015 14:19:18 +1000
Received: from filter.ourdomain.com (filter.ourdomain.com [xxx.xx.xxx.36])
    by backends.localdomain (Postfix) with ESMTPS id AF4A11181096
    for <support@ourdomain.com>; Wed, 29 Apr 2015 14:19:16 +1000 (AEST)
X-OurServers-BaruwaFW-From: client.name@clientdomain.com
X-OurServers-BaruwaFW: Not scanned: please contact OurServers for details
X-OurServers-BaruwaFW-ID: 1YnJSk-0002p8-1u
X-OurServers-BaruwaFW-Information: Please contact OurServers for more information
Received: from mail-hk1bn0109.outbound.protection.outlook.com ([134.170.140.109] helo=APAC01-HK1-obe.outbound.protection.outlook.com)
    by filter.ourdomain.com with esmtps (TLS1.1:RSA_AES_256_CBC_SHA1:256)
    (Baruwa 2.0)
    (envelope-from <client.name@clientdomain.com>)
    id 1YnJSk-0002p8-1u ret-id none;
    for support@ourdomain.com; Wed, 29 Apr 2015 14:19:06 +1000
Received: from SG2PR03MB0538.clientmail.prod.outlook.com (25.160.235.15) by
 SG2PR03MB0538.clientmail.prod.outlook.com (25.160.235.15) with Microsoft SMTP
 Server (TLS) id 15.1.148.16; Wed, 29 Apr 2015 04:18:56 +0000
Received: from SG2PR03MB0538.clientmail.prod.outlook.com ([25.160.235.15]) by
 SG2PR03MB0538.clientmail.prod.outlook.com ([25.160.235.15]) with mapi id
 15.01.0148.008; Wed, 29 Apr 2015 04:18:56 +0000
From: Debi Pulfer <client.name@clientdomain.com>
To: "support@ourdomain.com" <support@ourdomain.com>
Subject: <the subject line>
Thread-Topic: <the subject line>
Thread-Index: AdCCM5pcgNarzgsHQda7z9iL1hMxwQ==
Date: Wed, 29 Apr 2015 04:18:56 +0000
Message-ID: <SG2PR03MB0538BD73EE9D53216C22746CA8D70@SG2PR03MB0538.clientmail.prod.outlook.com>
Accept-Language: en-AU, en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
authentication-results: ourdomain.com; dkim=none (message not signed)
 header.d=none;
x-originating-ip: [119.17.57.2]
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:SG2PR03MB0538;
x-microsoft-antispam-prvs: <SG2PR03MB0538240747E926BA023D8FA0A8D70@SG2PR03MB0538.clientmail.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:;
x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(601004)(5005006)(3002001);SRVR:SG2PR03MB0538;BCL:0;PCL:0;RULEID:;SRVR:SG2PR03MB0538;
x-forefront-prvs: 05610E64EE
x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(66654002)(40100003)(76576001)(62966003)(450100001)(77156002)(86362001)(15975445007)(18206015028)(74316001)(5001960100001)(107886001)(99936001)(54356999)(102836002)(50986999)(229853001)(122556002)(2351001)(16236675004)(19625215002)(33656002)(17760045003)(66066001)(74482002)(2900100001)(92566002)(19300405004)(19580395003)(2656002)(2501003)(19617315012)(19580405001)(87936001)(569274001)(46102003)(19627595001)(110136001);DIR:OUT;SFP:1102;SCL:1;SRVR:SG2PR03MB0538;H:SG2PR03MB0538.clientmail.prod.outlook.com;FPR:;SPF:None;MLV:sfv;LANG:en;
Content-Type: multipart/related;
    boundary="_004_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_";
    type="multipart/alternative"
MIME-Version: 1.0
X-OriginatorOrg: clientdomain.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Apr 2015 04:18:56.0542
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 9712242a-35d3-4991-a4a2-5e426ccb393f
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SG2PR03MB0538
X-Baruwa-Relayed-From: xxx.xx.xxx.15

--_004_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_
Content-Type: multipart/alternative;
    boundary="_000_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_"

--_000_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

<snip>

--_000_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: base64

<snip>

--_000_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_--

--_004_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_
Content-Type: image/jpeg; name="image001.jpg"
Content-Description: image001.jpg
Content-Disposition: inline; filename="image001.jpg"; size=2629;
    creation-date="Wed, 29 Apr 2015 04:18:55 GMT";
    modification-date="Wed, 29 Apr 2015 04:18:55 GMT"
Content-ID: <image001.jpg@01D0826B.CFAE42C0>
Content-Transfer-Encoding: base64

<snip>

--_004_SG2PR03MB0538BD73EE9D53216C22746CA8D70SG2PR03MB0538apcp_--

Non-base64 encoded email, this was plaintext.

Return-Path: <info@thirdparty.com>
X-Original-To: support@brimir.ourdomain.com
Delivered-To: support@brimir.ourdomain.com
Received: from filter.ourdomain.com (filter.ourdomain.com [xxx.xx.xxx.36])
    by brimir.ourdomain.com (Postfix) with ESMTPS id 651EA7A2DF
    for <support@brimir.ourdomain.com>; Thu, 30 Apr 2015 14:18:11 +1000 (AEST)
X-OurServers-BaruwaFW-From: info@thirdparty.com
X-OurServers-BaruwaFW-SpamScore: s
X-OurServers-BaruwaFW: Not scanned: please contact OurServers for details
X-OurServers-BaruwaFW-ID: 1YnfvN-00009a-7k
X-OurServers-BaruwaFW-Information: Please contact OurServers for more information
Received: from [xxx.xx.xxx.15] (helo=backends.localdomain)
    by filter.ourdomain.com with esmtp (Baruwa 2.0)
    (envelope-from <info@thirdparty.com>)
    id 1YnfvN-00009a-7k ret-id none;
    for support@brimir.ourdomain.com; Thu, 30 Apr 2015 14:18:09 +1000
Received: from filter.ourdomain.com (filter.ourdomain.com [xxx.xx.xxx.36])
    by backends.localdomain (Postfix) with ESMTPS id 7BD521180DF4
    for <support@ourdomain.com>; Thu, 30 Apr 2015 14:18:08 +1000 (AEST)
X-OurServers-BaruwaFW-From: info@thirdparty.com
X-OurServers-BaruwaFW-SpamScore: s
X-OurServers-BaruwaFW: Not scanned: please contact OurServers for details
X-OurServers-BaruwaFW-ID: 1YnfvC-000090-Mq
X-OurServers-BaruwaFW-Information: Please contact OurServers for more information
Received: from mail-wg0-f54.google.com ([74.125.82.54])
    by filter.ourdomain.com with esmtps (TLS1.1:RSA_AES_128_CBC_SHA1:128)
    (Baruwa 2.0)
    (envelope-from <info@thirdparty.com>)
    id 1YnfvC-000090-Mq ret-id none;
    for support@ourdomain.com; Thu, 30 Apr 2015 14:17:59 +1000
Received: by wgin8 with SMTP id n8so48691907wgi.0
        for <support@ourdomain.com>; Wed, 29 Apr 2015 21:17:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:mime-version:in-reply-to:references:date
         :message-id:subject:from:to:content-type;
        bh=W60tcvTJWL58/hSqlN7UPIKEsRWB6HGVtV5HdVYPy+U=;
        b=WQT6zyli1Sn4gKFk+0eExKB5M/T9B7dH5GNzjHhMMPG0zf+Wb/gpAJh6TEtsqO+i4F
         sn4c29dSa/wylSk/utJUghOv1Ez5VrDCf5Vg1WsD8wWUqDH3FwDKXd6xDgGVnWi19BiT
         LjGB6b2u7IInu8b8/xrTDTClu0QhhgULgh7Cq104aX9R1xSqMtlqLt7hTAJX55g6m5xr
         zNKFSqk2FTdmLBMwCVC/IOypRn2LORQ9g3tdO43+cw4Cqmy6UJ1h0djt9/DstUldGCFR
         KX4sY7Bypz+M+S5DwHZQXphvdAJzM5BEYdD6KMRmOfnpGXBTyxi9gAtEaZryTmeIloFu
         LkFQ==
X-Gm-Message-State: ALoCoQlXmPxIHGNN7WV3B3LkzGLakcs7w+EHqoDvTQapGrIe6esXufwiGOYe4cdwNWUzl/gV11Ox
MIME-Version: 1.0
X-Received: by 10.180.93.193 with SMTP id cw1mr1734878wib.12.1430367468769;
 Wed, 29 Apr 2015 21:17:48 -0700 (PDT)
Received: by 10.194.16.35 with HTTP; Wed, 29 Apr 2015 21:17:48 -0700 (PDT)
X-Originating-IP: [124.243.138.179]
Received: by 10.194.16.35 with HTTP; Wed, 29 Apr 2015 21:17:48 -0700 (PDT)
In-Reply-To: <26a2-55418880-1-6d44ab00@15442659>
References: <CAELfwW2mtLGTEJZg0YzcpR6moN5DLhY5dx2L14cj7VAWZ+EwHg@mail.gmail.com>
    <26a2-55418880-1-6d44ab00@15442659>
Date: Thu, 30 Apr 2015 14:17:48 +1000
Message-ID: <CAELfwW1Cyr9k1cpxU34AMATjKhuEaL5TPY0MH2mzATRWd9=bMg@mail.gmail.com>
Subject: <subject here>
From: Third Party <info@thirdparty.com>
To: support@ourdomain.com
Content-Type: multipart/alternative; boundary=f46d043bdededfe6e60514e96032
X-Baruwa-Relayed-From: xxx.xx.xxx.15

--f46d043bdededfe6e60514e96032
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

---------- Forwarded message ----------

<More email text above here>

Kind regards,=E2=80=AA=E2=80=AA=E2=80=AB

=E2=80=AA=E2=80=AA=E2=80=AB

<Client Name Here>

--f46d043bdededfe6e60514e96032
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<snip>

--f46d043bdededfe6e60514e96032--
frenkel commented 9 years ago

Did you change the collation for the database, alle the tables and all the fields in the tables? (Three different things unfortunately.) After changing these, only new emails will be correct, the old ones need a query to convert them correctly I believe.

Would it be possible for you to test this with PostgreSQL as well? We only use PostgreSQL and have no such problems, so I think it might be a MySQL configuration bug we're tracking. If we can find out the exact MySQL option that causes this, we can add it to the README.

SurgeMonkeyAU commented 9 years ago

Ah actually we had missed the collation on the table fields, so only the MySQL table itself had been updated.

I can try testing out Postgres, but i will need to build a new VM to test with. (Which is OK, because I wanted to test a safe upgrade method).

I will try build the new VM with Postgres support tomorrow as well as keep an eye out for more import errors.

frenkel commented 9 years ago

So changing the table collation fixed it? Then there is no need to test PostgreSQL, we'll add this fix to the README :)

SurgeMonkeyAU commented 9 years ago

Will try to verify shortly. We are requesting some of the clients which had the problem to send new test emails through to the system as we can't seem to generate them ourselves.

SurgeMonkeyAU commented 9 years ago

We haven't had any more occurrences of the issue for a few days now. So i believe updating the MySQL table fields to utf8_general_ci appears to have fixed the problem with the imports.

If the we come across the issue again, i will re-open.