Closed judofyr closed 11 years ago
Summoning @wycats. He wanted this cleaned up before Rails could adopt Tilt. I plead ignorance.
Calling in some other template engine hackers who might be interested:
Tilt: @rtomayko Sinatra: @rkh Rails: @josh Slim: @stonean, @minad, @fredwu Haml: @nex3 Issue #66: @julian7
Tilt is working fine at the moment, so this isn't a pressing issue; take your time and let's come up with the best solution together :-)
The Padrino guys might be interested too I guess: @nesquena and @DAddYE
If you try to find file encoding (i.e. a client send you somehow, uploaded somewhere or brought to you by usb or whatnot) you may could get some help from http://prometheus.rubyforge.org/cmess/
charset = CMess::GuessEncoding::Automatic.guess(input)
In Sinatra we adopted the "If I don't know the encoding, use UTF-8" way, which is why default_encoding
was introduced in the first place. Encoding is a really hairy thing and the ruby behavior currently differs not only between Windows and Unix but also between Linux and OSX. Take into account that as soon as one string used in a template or the template itself is BINARY, Ruby will raise an exception on interpolation. It is hard already to have proper encoding for user input over HTTP (the HTTP client does not tell you what encoding form data is in). Also note that there was a common agreement on ruby-core to have 1.9.3 default to Unicode rather than making any default dependent on the OS. I would love to keep changes in a separate branch for now and have that branch intensely tested on different platforms with different settings. Targeting 1.4 is a good idea imo.
100% agree with @rkh, In my opinion we can wait some months and stress test it... or instead wait ruby 1.9.3.
Related (extended) discussion about this same issue. Keep in mind while this started as a Padrino bug, it runs way deeper back to this issue with Tilt:
https://github.com/padrino/padrino-framework/issues/519#comment_1129109
I think we do need to address this. As it stands templates cannot contain UTF-8 characters inside them if Encoding.default_internal
is set to Encoding::UTF_8
unless an encoding is set to every single template explicitly. In Sinatra, this is hackily avoided since Encoding.default_internal
is set to nil. But that decision will come back to bite us relatively quickly. Tilt needs to respect default_external and not default to parsing every template as ASCII-8BIT.
Quoting @nex3 in that other thread because this seems to make the most sense:
The UTF-8-over-HTTP problem is going to exist just as much with BINARY templates as with default_external-encoded ones. Fundamentally, you just can't hide from users the fact that they must specify the encodings of their external files.
From a template engine's point of view, Rails' solution is pretty much ideal: use default_external (which defaults to UTF-8) when loading a template, but provide a hook for template engines to recognize encoding comments. If no encoding comment is found and the file isn't valid in default_external, throw an error.
@nex3, @judofyr, @nesquena, @wycats: What about https://gist.github.com/964024#file_tilt.diff (see discussion in Padrion issue).
I will run this against the simplest failing case I outlined here: https://github.com/padrino/padrino-framework/issues/519#issuecomment-1120895 and see if it works later and report back. Thanks for whipping up this patch.
Agreed with @nesquena. Just got bit by the inability to render CoffeeScript templates which contain multibyte characters (Tilt reads them in as ASCII). Here's the monkeypatch:
Tilt::CoffeeScriptTemplate.class_eval do
alias original_prepare prepare
def prepare
original_prepare
data.force_encoding @default_encoding
end
end
Yeah this is still a lingering problem although admittedly we fixed it hackily by simply no longer setting the correct encodings in 1.9.2. This has unfortunate ramifications but it is better then the alternative from the padrino developers perspective until this gets resolved at the Tilt level.
/cc @apotonick, @brianmario
Some more eyeballs on this issue potentially.
Also wanted to make sure everyone saw this link from @nesquena earlier since it has some really good discussion around how much we should rely on Encoding.default_external
:
https://github.com/padrino/padrino-framework/issues/519
@rkh How does that patch factor in? Do you feel like that addresses the main issues with template files on disk?
I mostly agree with @nex3 on https://github.com/padrino/padrino-framework/issues/519#issuecomment-1129156 (also quoted by @nesquena above).
Unless you use an encoding detection library like charlock_holmes, the only way to reliably know what the actual encoding of the template off the filesystem is to have the coder specify it using a magic header. Encoding.default_external
should be used as the fallback, and if it blows up I think it's perfectly reasonable to raise that to the user asking them to specify the encoding of the offending template using a magic header. They should only need to do that for templates that aren't compatible with Encoding.default_external
.
Then again, a valid point could be made to say that we should never raise an exception but instead just do our best to display the content - even if it ends up corrupt once sent down to the browser. Maybe that could be a configurable switch? raise_encoding_errors
or something? I could go either way, but adding more config options feels like it should be a last-resort.
Then again, a valid point could be made to say that we should never raise an exception but instead just do our best to display the content - even if it ends up corrupt once sent down to the browser
That decision can still be left up to the app, though, I think. If you wanted non-draconian encoding error handling you'd set things up to use BINARY everywhere, essentially mimicking 1.8 behavior.
This raises a really good point though. If we move to Encoding.default_external
as a default for templates read from disk, which is starting to feel right to me, then shit is going to blow up on existing apps since we'll be draconian by default. I'm going to want to do a major rev on the version number if that's the case.
EDIT: s/Encoding.default_encoding/Encoding.default_external/
@brianmario 100% agree.
I don't know how much cost in terms of performances reading everything in BINARY we should check this; I really hate the magic comment & c. Btw setting Encoding.default_internal
or Encoding.default_external
is quite simple so the major problem for the developer is to have all files with the same encoding. I think this isn't and bad behavior almost for me. So, my vote is to use Encoding.default_external
.
I'm leaning toward @rkh's patch with maybe an additional layer where you can override default_external
specifically for tilt.
Tilt.default_external_encoding
, nil
by default.Encoding.default_external
when Tilt.default_external_encoding
is nil
.:default_encoding
option when given.This would let you Tilt.default_external_encoding = 'utf8'
if you wanted to ensure templates are treated as utf-8 but don't want to change Encoding.default_external
for whatever reason. You could also set Tilt.default_external_encoding = 'BINARY'
to get the current behavior.
Tilt 1.4 could ship with Tilt.default_external_encoding = 'BINARY'
and Tilt 2.0 would ship with Tilt.default_external_encoding = nil
.
@rtomayko Yeah I think we are all on the same page. The Tilt.default_external_encoding
which will eventually default to Encoding.default_external
is a fairly safe approach. This allows for a smooth transition from the existing behavior. I am happy to test any branch with these adjustments in Padrino to try and reproduce the error as outlined in the old issue. If it works there with utf-8 encodings set, then that should be a good indicator.
After figuring out the encoding, should we leave it by that or call String#encode
(with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.
FWIW the Encoding.default_external=
method raises a warning with ruby -w
. Not sure what ruby-core has planned for it, but that warning kinda makes me feel like they might disable the ability to set it from inside a running app (would have to be set from $LC_TYPE
or $LANG
I guess?)`. I don't know either way, but just wanted to make us all aware.
Blah. Yeah that's a whole other dimension to this I haven't considered. We may need Tilt.default_internal_encoding
as well, with semantics similar to Tilt.default_external_encoding
.
@rtomayko I suggest to keep it more strict only to Encoding.default_external
, at the moment I don't know when can be useful have two separate encodings, but is true that we don't break compatibility.
Always use magic comment when found in template (guess the engine needs to support this?)
As said I don't love this, but maybe others would like to use it.
Interesting, wasn't aware of that warning when setting default_external
. We would still be able to access the default encoding type through that even if it isn't expected to be set explicitly. Relying on the default external encoding and encoding everything to that internally strikes me as the most consistent approach that is least likely to be an annoyance for the end user developer as @rkh noted above.
/cc @raggi (would be interested to hear your thoughts)
Always use magic comment when found in template (guess the engine needs to support this?)
Instead of this I think we should just have a method on Tilt::Template (detect_encoding
maybe) which returns nil
by default and can be overridden by the template mapper. It's up to the template mapper to decide if it wants to use magic comments or if it just defaults to BINARY
.
After figuring out the encoding, should we leave it by that or call
String#encode
(with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.
Well, the output from the template engine isn't guaranteed to be a string. It could be an object or an Array (think Rack-streaming) instead. Beside, it's pretty simple for anyone who uses Tilt to add an #encode
at the end.
By the way: "@konstantinhaase any reason not to just extract what rails is doing? I spent a lot of time on it and don't want to deal with the bugs again." - @wycats - http://twitter.com/wycats/status/113458749854842880
@rkh Which is ... ?
Bugs?
@rtomayko: dunno, haven't looked into it yet.
-- Gesendet von meinem Palm Prē Ryan Tomayko <reply@reply.github.com> schrieb am 13.09.2011 13:02:
@rkh Which is ... ?
Reply to this email directly or view it on GitHub:
https://github.com/rtomayko/tilt/issues/75#issuecomment-2085839
+1 to what rails is doing
@judofyr Thanks for running that down. Most of AV's behavior seems to make good sense with Tilt, although I think we need a way to enable / disable some of it.
To summarize AV's behavior:
Determining template source encoding:
Encoding.default_external
.Template source transcoding:
String#force_encoding
.Encoding.default_internal
with String#encode!()
.Template source to Ruby source conversion:
Encoding.default_internal
into template specific handler.default_internal
.encode!()
to guarantee the ruby source is default_internal
.module_eval
.All that's left is calling the method, which should return a string encoded as default_internal
.
AV seems to assume Encoding.default_internal
will be set. Tilt needs to work when default_internal
is nil, which complicates things somewhat.
One thing that absolutely should happen is: for anything that tilt reads from disk using an IO object it creates, do something like:
data.force_encoding(magic || default_encoding || Encoding.default_external)
data.encode!
Note that the transcode step is a noop when Encoding.default_internal
is nil.
Some other things we should consider supporting:
force_encoding(encoding)
on all template source strings when detected.The main issue I'm struggling with in all this is conversion to default_internal
of template source strings returned from a reader block. I really hoped we could punt that responsibility to the caller but it doesn't work out. Here's my reasoning:
default_internal
regardless of external encoding is a worthwhile feature and makes sense for many environments.Essentially I think we should use Encoding.default_internal
as a flag that says "always convert template source from the detected template source encoding to default_internal
" regardless of where the source comes from. Perhaps additionally an option is needed to turn that behavior off at the Template instance level; otherwise, evaluating templates under other encodings becomes impossible when Encoding.default_internal
is set.
To some extent, encodings should be handled individually by each handler. Not saying tilt shouldn't do anything, but I think each handler class should opt into the magic comment syntax and automatic transcoding. Not everything needs this. The coffee handler only works with utf-8, you can't write CS in anything else. Also sass/less can handle encodings via @charset
css directive.
Think the important part is defining the contact for how all handlers should be expected to behave.
Ryan's outline looks good.
@josh Good to know. Yeah, I think we can get some basic stuff into Template
without too much work but then we're really going to need to audit each template engine individually.
On Thu, Sep 15, 2011 at 2:50 AM, Ryan Tomayko reply@reply.github.com wrote:
@josh Good to know. Yeah, I think we can get some basic stuff into
Template
without too much work but then we're really going to need to audit each template engine individually.Are you talking about Tilt::Template or AV::Template here?
Reply to this email directly or view it on GitHub: https://github.com/rtomayko/tilt/issues/75#issuecomment-2099903
A more concrete proposal with code: #107
Sorry guys that I suddenly appear from my bear lair (actually I've encountered the problem yesterday and saw that almost everybody stopped discussing it by May), but ... can anybody say how to conquer this issue when I have it with Padrino stack? I use haml/markdown/ruby_vars in UTF-8. How to settle them all?
@argent-smith Try:
# config/boot.rb
Padrino.after_load do
Encoding.default_internal = nil
end
does that help?
It helped until I used the single UTF-8 encoded variable. So now I have UTF-8 haml rendered well and "ncompatible character encodings: UTF-8 and ASCII-8BIT" when I try to do "= utf8_string_returning_method" in the haml template.
The next step would be to try and create a simple stripped down test case of this failure, preferably in Sinatra as a reference. Then if you can, check out this tilt patch from @rtomayko: https://github.com/rtomayko/tilt/pull/107/files, and see if with that applied the issue goes away.
I'll finish the test suite by tomorrow.
See my test suite argent-smith/sinatra-test.
It looks like on clean Sinatra with stock Tilt the issue isn't reproduced: everything goes just fine. I'll make yet another test suite with Padrino and tell you what happens.
Well, after adding markdown to my test I've reproduced the issue. Actually markdown was in my app when I've first encountered the issue. The error doesn't depend on a particular markdown implementation: I've tried kramdown and redcarpet with the same result. For the further details see my test suite mentioned above.
Great, I was hoping you could isolate this test with sinatra alone, since Padrino doesn't really mess with encodings. So the above repo fails with stock Tilt with Sinatra? Can you try again with the patched Tilt that Ryan has a pull request for and see if that makes the issue go away?
I'll try it ASAP
@nesquena @rtomayko I've got a conflict when tried to merge encodings
to master
: see /lib/tilt/template.rb in gist 1374322. Would you tell me how to resolve it?
So... are we going to honor Encoding.default_external
?
@djanowski doesn't help in my case (xcuse my stupidity ;-)
Any progress with that issue?
Templates are very much based on files, which therefore means we'll have to worry about encodings. There are two ways templates can be passed to an engine: through a block (which returns a string:
klass.new { 'foo' }
) or through a filename.Currently, encodings are handled as such:
For all cases, you can pass
:default_encoding => "FOO"
as an option which will set the@default_encoding
-ivar. In addition, if the precompiled code (when using precompiled mode) includes a magic comment (# coding: FOO
) OR a @default_encoding is set, the generated code will include a magic comment before the preamble.My thoughts
Some observation:
Encoding.default_external
or the:default_encoding
option)Encoding.default_external/internal
. It should only get a template (encoded in some encoding) together with a guess from Tilt's side (which handles the file reading).Let's go over each of the use cases and how I believe we should change Tilt:
Fetched through block
Any template that's fetched through a block already has an encoding. I think we should keep the current behaviour such that the template keeps its encoding and
@default_encoding
is set tooptions[:default_encoding]
(which defaults to nil).Fetched through file system
We don't know the encoding, so we'll read the file in binary mode and set
@default_encoding
tooptions[:default_encoding] || Encoding.default_external
(this last one is different from today).The precompiled code from the template engine
Because tilt.rb is written in UTF-8 and the precompiled code will be manipulated with literal strings from tilt.rb, the precompiled code generated by the template engine must be compatible with UTF-8 (not a code change, just a spec).
The encoding of the final generated code
The encoding of the final generated code should be decided by the template engine. There are two ways: the precompiled code is encoded in a specific encoding or the precompiled code includes a magic comment. Tilt should not add a magic comment based on
@default_encoding
because that's just a hint from Tilt's side and it's the engine responsibility to handle it correctly. (code change).Why
@default_encoding
is only a hintEven when you pass
:default_encoding
as an option (and it's not inherited fromEncoding.default_external
when reading from the file system) we'll have to depend on that the template engine handles it. The engine might have a different way of checking the encoding of the template, so we can't just@data.encode(@default_encoding)
in case it leads to an encoding error. The default encoding is after all only that: the default encoding, which the engine should fall back to if it can't guess it.How template engine should handle the encoding
Many simple template engine don't really need to work on the encoding-level, so from theirs point of view, this should be enough:
If you really need to be fully encoding-aware before you parse and compile, you should do something like this:
Your thoughs
What do you think?