toland / patron

Ruby HTTP client based on libcurl
http://toland.github.com/patron/
MIT License
541 stars 73 forks source link

[WIP] Rework charset coercion #122

Closed julik closed 8 years ago

julik commented 8 years ago

@toland can you take a look and tell me if this matches your ideas about the charset handling rework? The only noticeable difference is that implicit coercion for Response#body is removed altogether, and body is assumed to be binary.

julik commented 8 years ago

@imajes @zachad @teonimesic I see you guys all have forks of the gem with alterations to charset handling. Will this change help you? It is a bit less guessing and less magical than some would like, but it trades some convenience for clarity.

toland commented 8 years ago

@julik I think I was planning to do something more complicated that accomplished basically the same thing. I really like your approach.

julik commented 8 years ago

Ok. @toland the only thing I am sorta doubtful is whether we should give a "preferential" treatment to UTF8 and return a UTF8 string from #body if possible. The approach in this PR might break existing code if body returns a binary str, and the calling code does not expect it. What do you think?

toland commented 8 years ago

@julik I am in favor of returning UTF8 when possible. That seems like it would be what the user wants most of the time.

julik commented 8 years ago

TBH I am in doubt because if this. What I wanted to achieve is to have one method that would offer "raw" body (or as close to raw as possible) without any smarts. This is then supplemented by a method that does perform coercions and conversions. Granted, I can just add a "preemptive" opportunistic force_encoding() call on the result expecting that the user just wants to have UTF-8 by default, but then the body() method also tries to be smart, while my intention was to separate the "smart" handling path from the "dumb" one. So I am a bit divided. You think preemptively assuming that the user wants UTF-8 if nothing else is specified is a good idea?

toland commented 8 years ago

I think that having the body encoded as UTF-8 by default will cause the "least surprise" with users. Most of the time it will be right, and a few times it will not. That having been said, I don't feel strongly about it. You make some good technical arguments for why raw would be a better choice. If you think that raw is a better option, then lets do that.

I will say that, in general, it probably isn't worth getting wrapped around the spokes on a decision like this. Go with your gut and make the decision that seems best. If it turns out that it isn't the best thing, we can adjust later.

AlexRiedler commented 8 years ago

Not sure if this though helps as well.

I was having issues in production with encoding and unicode characters; in particular:

Encoding::UndefinedConversionError: "\xB8" from ASCII-8BIT to UTF-8

I believe this might be because of Emoji's that are contained in the charset results coming out of ElasticSearch; I haven't went deep into it yet though.

I feel like this is cause we force to ASCII encoding; then for to default (which is UTF-8 for my system); thus it thinks there is no correct path.