thoughtbot / griddler

Simplify receiving email in Rails (deprecated)
http://griddler.io/
MIT License
1.38k stars 199 forks source link

Invalid Byte Sequence Error UTF-8 #72

Closed chuckblake closed 11 years ago

chuckblake commented 11 years ago

On some incoming emails, I'm receiving the following error when using SendGrid inbound parse - > Griddler.

ArgumentError: invalid byte sequence in UTF-8

griddler/emails#create

vendor/bundle/ruby/1.9.1/gems/mail-2.4.4/lib/mail/core_extensions/string.rb:4

any ideas or suggestions on how to fix this?

here's some additional information from the backtrace: vendor/bundle/ruby/1.9.1/gems/mail-2.4.4/lib/mail/core_extensions/string.rb:4:in gsub' vendor/bundle/ruby/1.9.1/gems/mail-2.4.4/lib/mail/core_extensions/string.rb:4:into_crlf' vendor/bundle/ruby/1.9.1/gems/mail-2.4.4/lib/mail/header.rb:39:in initialize' vendor/bundle/ruby/1.9.1/gems/griddler-0.5.0/lib/griddler/email_parser.rb:45:innew' vendor/bundle/ruby/1.9.1/gems/griddler-0.5.0/lib/griddler/email_parser.rb:45:in extract_headers' vendor/bundle/ruby/1.9.1/gems/griddler-0.5.0/lib/griddler/email.rb:59:inextract_headers' vendor/bundle/ruby/1.9.1/gems/griddler-0.5.0/lib/griddler/email.rb:21:in `initialize'

theycallmeswift commented 11 years ago

The fix is to prune the input for invalid UTF-8 Bytes. http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/

Not sure if that should be Griddler's responsibility or the end user's though.

calebhearth commented 11 years ago

@jayroh did a little work on UTF-8 stuff a while back. He might have input here.

jayroh commented 11 years ago

@chuckblake these are tough to diagnose without getting into the guts of the exact message that caused the error. I had seen this a few times on a project and had to dig through the stack and params POST'ed into the controller. Do you have that info? If so would you be able to write a failing test for this?

I wrote a post on our blog, actually, about how to tackle something like this -> http://robots.thoughtbot.com/post/42664369166/fight-back-utf-8-invalid-byte-sequences

It's not specific to griddler, but instead the UTF-8 issue.

dorra commented 11 years ago

http://robots.thoughtbot.com/post/42664369166/fight-back-utf-8-invalid-byte-sequences

@jayroh This helps - but only a bit. If you receive emails via sendgrid and these contain special characters (Umlaute), these are cut out. I do not know if this is a problem of sendgrid or griddler?!

Here is an gist of the girddler email object: https://gist.github.com/dorra/6354910

theycallmeswift commented 11 years ago

Sounds like we need to open up a SendGrid bug report. /cc @scottmotte

jayroh commented 11 years ago

I'm going to give it a shot to see if I can replicate this but it'll be tough.

@chuckblake @dorra are you guys of the opinion this is as basic as an umlaute that's causing this?

chuckblake commented 11 years ago

Here is a subject line from one of the emails that are erroring out on me:

"The Great TOH Giveaway is Back! We’re giving away $530,324 in prizes." I think it's the ' in we're that is causing the problem.

Chuck

On Thu, Aug 29, 2013 at 11:32 AM, Joel Oliveira notifications@github.comwrote:

I'm going to give it a shot to see if I can replicate this but it'll be tough.

@chuckblake https://github.com/chuckblake @dorrahttps://github.com/dorraare you guys of the opinion this is as basic as an umlaute that's causing this?

— Reply to this email directly or view it on GitHubhttps://github.com/thoughtbot/griddler/issues/72#issuecomment-23498526 .

jayroh commented 11 years ago

@dorra @theycallmeswift @calebthompson I've pushed a commit above :point_up: in a branch to try and address this.

@chuckblake come to think of it I'm not sure we're sanitizing subjects yet so this is a great heads up. I'll take your use case (thank you for providing that, by the way - greatly appreciated) and see if I can get some test coverage around that case.

theycallmeswift commented 11 years ago

Commit lgtm. Good work.

motdotla commented 11 years ago

(concerning "If you receive emails via sendgrid and these contain special characters (Umlaute), these are cut out.", I've filed a ticket and we are looking into it here at SendGrid)

jayroh commented 11 years ago

Closing after fix in 34e01c0f54aabf042991344861872cca102dafd8