Closed julik closed 8 years ago
@imajes @zachad @teonimesic I see you guys all have forks of the gem with alterations to charset handling. Will this change help you? It is a bit less guessing and less magical than some would like, but it trades some convenience for clarity.
@julik I think I was planning to do something more complicated that accomplished basically the same thing. I really like your approach.
Ok. @toland the only thing I am sorta doubtful is whether we should give a "preferential" treatment to UTF8 and return a UTF8 string from #body
if possible. The approach in this PR might break existing code if body
returns a binary str, and the calling code does not expect it. What do you think?
@julik I am in favor of returning UTF8 when possible. That seems like it would be what the user wants most of the time.
TBH I am in doubt because if this. What I wanted to achieve is to have one method that would offer "raw" body (or as close to raw as possible) without any smarts. This is then supplemented by a method that does perform coercions and conversions. Granted, I can just add a "preemptive" opportunistic force_encoding()
call on the result expecting that the user just wants to have UTF-8 by default, but then the body()
method also tries to be smart, while my intention was to separate the "smart" handling path from the "dumb" one. So I am a bit divided. You think preemptively assuming that the user wants UTF-8 if nothing else is specified is a good idea?
I think that having the body encoded as UTF-8 by default will cause the "least surprise" with users. Most of the time it will be right, and a few times it will not. That having been said, I don't feel strongly about it. You make some good technical arguments for why raw would be a better choice. If you think that raw is a better option, then lets do that.
I will say that, in general, it probably isn't worth getting wrapped around the spokes on a decision like this. Go with your gut and make the decision that seems best. If it turns out that it isn't the best thing, we can adjust later.
Not sure if this though helps as well.
I was having issues in production with encoding and unicode characters; in particular:
Encoding::UndefinedConversionError: "\xB8" from ASCII-8BIT to UTF-8
I believe this might be because of Emoji's that are contained in the charset results coming out of ElasticSearch; I haven't went deep into it yet though.
I feel like this is cause we force to ASCII encoding; then for to default (which is UTF-8 for my system); thus it thinks there is no correct path.
@toland can you take a look and tell me if this matches your ideas about the charset handling rework? The only noticeable difference is that implicit coercion for
Response#body
is removed altogether, and body is assumed to be binary.