rtomayko / tilt

Generic interface to multiple Ruby template engines
http://github.com/rtomayko/tilt
Other
1.95k stars 225 forks source link

How should we handle encodings? #75

Closed judofyr closed 11 years ago

judofyr commented 13 years ago

Templates are very much based on files, which therefore means we'll have to worry about encodings. There are two ways templates can be passed to an engine: through a block (which returns a string: klass.new { 'foo' }) or through a filename.

Currently, encodings are handled as such:

For all cases, you can pass :default_encoding => "FOO" as an option which will set the @default_encoding-ivar. In addition, if the precompiled code (when using precompiled mode) includes a magic comment (# coding: FOO) OR a @default_encoding is set, the generated code will include a magic comment before the preamble.

My thoughts

Some observation:

Let's go over each of the use cases and how I believe we should change Tilt:

Fetched through block

Any template that's fetched through a block already has an encoding. I think we should keep the current behaviour such that the template keeps its encoding and @default_encoding is set to options[:default_encoding] (which defaults to nil).

Fetched through file system

We don't know the encoding, so we'll read the file in binary mode and set @default_encoding to options[:default_encoding] || Encoding.default_external (this last one is different from today).

The precompiled code from the template engine

Because tilt.rb is written in UTF-8 and the precompiled code will be manipulated with literal strings from tilt.rb, the precompiled code generated by the template engine must be compatible with UTF-8 (not a code change, just a spec).

The encoding of the final generated code

The encoding of the final generated code should be decided by the template engine. There are two ways: the precompiled code is encoded in a specific encoding or the precompiled code includes a magic comment. Tilt should not add a magic comment based on @default_encoding because that's just a hint from Tilt's side and it's the engine responsibility to handle it correctly. (code change).

Why @default_encoding is only a hint

Even when you pass :default_encoding as an option (and it's not inherited from Encoding.default_external when reading from the file system) we'll have to depend on that the template engine handles it. The engine might have a different way of checking the encoding of the template, so we can't just @data.encode(@default_encoding) in case it leads to an encoding error. The default encoding is after all only that: the default encoding, which the engine should fall back to if it can't guess it.

How template engine should handle the encoding

Many simple template engine don't really need to work on the encoding-level, so from theirs point of view, this should be enough:

def compile(template, default_encoding)
  data = template.dup.force_encoding("BINARY")
  enc = encoding_specified_in_template(data) # extract magic comment or whatnot
  compiled = parse_and_compile(data)
  compiled.force_encoding(enc || default_encoding || template.encoding)
end

If you really need to be fully encoding-aware before you parse and compile, you should do something like this:

def compile(template, default_encoding)
  data = template.dup.force_encoding("BINARY")
  enc = encoding_specified_in_template(data) || default_encoding || template.encoding
  data.force_encoding(enc)
  compiled = parse_and_compile(data)
  compiled.force_encoding(enc) # You don't need this if you rather generate a magic comment
end

Your thoughs

What do you think?

josh commented 13 years ago

Summoning @wycats. He wanted this cleaned up before Rails could adopt Tilt. I plead ignorance.

judofyr commented 13 years ago

Calling in some other template engine hackers who might be interested:

Tilt: @rtomayko Sinatra: @rkh Rails: @josh Slim: @stonean, @minad, @fredwu Haml: @nex3 Issue #66: @julian7

Tilt is working fine at the moment, so this isn't a pressing issue; take your time and let's come up with the best solution together :-)

judofyr commented 13 years ago

The Padrino guys might be interested too I guess: @nesquena and @DAddYE

ghost commented 13 years ago

If you try to find file encoding (i.e. a client send you somehow, uploaded somewhere or brought to you by usb or whatnot) you may could get some help from http://prometheus.rubyforge.org/cmess/

charset = CMess::GuessEncoding::Automatic.guess(input)

rkh commented 13 years ago

In Sinatra we adopted the "If I don't know the encoding, use UTF-8" way, which is why default_encoding was introduced in the first place. Encoding is a really hairy thing and the ruby behavior currently differs not only between Windows and Unix but also between Linux and OSX. Take into account that as soon as one string used in a template or the template itself is BINARY, Ruby will raise an exception on interpolation. It is hard already to have proper encoding for user input over HTTP (the HTTP client does not tell you what encoding form data is in). Also note that there was a common agreement on ruby-core to have 1.9.3 default to Unicode rather than making any default dependent on the OS. I would love to keep changes in a separate branch for now and have that branch intensely tested on different platforms with different settings. Targeting 1.4 is a good idea imo.

DAddYE commented 13 years ago

100% agree with @rkh, In my opinion we can wait some months and stress test it... or instead wait ruby 1.9.3.

nesquena commented 13 years ago

Related (extended) discussion about this same issue. Keep in mind while this started as a Padrino bug, it runs way deeper back to this issue with Tilt:

https://github.com/padrino/padrino-framework/issues/519#comment_1129109

I think we do need to address this. As it stands templates cannot contain UTF-8 characters inside them if Encoding.default_internal is set to Encoding::UTF_8 unless an encoding is set to every single template explicitly. In Sinatra, this is hackily avoided since Encoding.default_internal is set to nil. But that decision will come back to bite us relatively quickly. Tilt needs to respect default_external and not default to parsing every template as ASCII-8BIT.

nesquena commented 13 years ago

Quoting @nex3 in that other thread because this seems to make the most sense:

The UTF-8-over-HTTP problem is going to exist just as much with BINARY templates as with default_external-encoded ones. Fundamentally, you just can't hide from users the fact that they must specify the encodings of their external files.

From a template engine's point of view, Rails' solution is pretty much ideal: use default_external (which defaults to UTF-8) when loading a template, but provide a hook for template engines to recognize encoding comments. If no encoding comment is found and the file isn't valid in default_external, throw an error.

rkh commented 13 years ago

@nex3, @judofyr, @nesquena, @wycats: What about https://gist.github.com/964024#file_tilt.diff (see discussion in Padrion issue).

nesquena commented 13 years ago

I will run this against the simplest failing case I outlined here: https://github.com/padrino/padrino-framework/issues/519#issuecomment-1120895 and see if it works later and report back. Thanks for whipping up this patch.

mislav commented 13 years ago

Agreed with @nesquena. Just got bit by the inability to render CoffeeScript templates which contain multibyte characters (Tilt reads them in as ASCII). Here's the monkeypatch:


Tilt::CoffeeScriptTemplate.class_eval do
  alias original_prepare prepare
  def prepare
    original_prepare
    data.force_encoding @default_encoding
  end
end
nesquena commented 13 years ago

Yeah this is still a lingering problem although admittedly we fixed it hackily by simply no longer setting the correct encodings in 1.9.2. This has unfortunate ramifications but it is better then the alternative from the padrino developers perspective until this gets resolved at the Tilt level.

rtomayko commented 13 years ago

/cc @apotonick, @brianmario

Some more eyeballs on this issue potentially.

Also wanted to make sure everyone saw this link from @nesquena earlier since it has some really good discussion around how much we should rely on Encoding.default_external:

https://github.com/padrino/padrino-framework/issues/519

@rkh How does that patch factor in? Do you feel like that addresses the main issues with template files on disk?

brianmario commented 13 years ago

I mostly agree with @nex3 on https://github.com/padrino/padrino-framework/issues/519#issuecomment-1129156 (also quoted by @nesquena above).

Unless you use an encoding detection library like charlock_holmes, the only way to reliably know what the actual encoding of the template off the filesystem is to have the coder specify it using a magic header. Encoding.default_external should be used as the fallback, and if it blows up I think it's perfectly reasonable to raise that to the user asking them to specify the encoding of the offending template using a magic header. They should only need to do that for templates that aren't compatible with Encoding.default_external.

Then again, a valid point could be made to say that we should never raise an exception but instead just do our best to display the content - even if it ends up corrupt once sent down to the browser. Maybe that could be a configurable switch? raise_encoding_errors or something? I could go either way, but adding more config options feels like it should be a last-resort.

rtomayko commented 13 years ago

Then again, a valid point could be made to say that we should never raise an exception but instead just do our best to display the content - even if it ends up corrupt once sent down to the browser

That decision can still be left up to the app, though, I think. If you wanted non-draconian encoding error handling you'd set things up to use BINARY everywhere, essentially mimicking 1.8 behavior.

This raises a really good point though. If we move to Encoding.default_external as a default for templates read from disk, which is starting to feel right to me, then shit is going to blow up on existing apps since we'll be draconian by default. I'm going to want to do a major rev on the version number if that's the case.

EDIT: s/Encoding.default_encoding/Encoding.default_external/

DAddYE commented 13 years ago

@brianmario 100% agree.

DAddYE commented 13 years ago

I don't know how much cost in terms of performances reading everything in BINARY we should check this; I really hate the magic comment & c. Btw setting Encoding.default_internal or Encoding.default_external is quite simple so the major problem for the developer is to have all files with the same encoding. I think this isn't and bad behavior almost for me. So, my vote is to use Encoding.default_external.

rtomayko commented 13 years ago

I'm leaning toward @rkh's patch with maybe an additional layer where you can override default_external specifically for tilt.

This would let you Tilt.default_external_encoding = 'utf8' if you wanted to ensure templates are treated as utf-8 but don't want to change Encoding.default_external for whatever reason. You could also set Tilt.default_external_encoding = 'BINARY' to get the current behavior.

Tilt 1.4 could ship with Tilt.default_external_encoding = 'BINARY' and Tilt 2.0 would ship with Tilt.default_external_encoding = nil.

nesquena commented 13 years ago

@rtomayko Yeah I think we are all on the same page. The Tilt.default_external_encoding which will eventually default to Encoding.default_external is a fairly safe approach. This allows for a smooth transition from the existing behavior. I am happy to test any branch with these adjustments in Padrino to try and reproduce the error as outlined in the old issue. If it works there with utf-8 encodings set, then that should be a good indicator.

rkh commented 13 years ago

After figuring out the encoding, should we leave it by that or call String#encode (with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.

brianmario commented 13 years ago

FWIW the Encoding.default_external= method raises a warning with ruby -w. Not sure what ruby-core has planned for it, but that warning kinda makes me feel like they might disable the ability to set it from inside a running app (would have to be set from $LC_TYPE or $LANG I guess?)`. I don't know either way, but just wanted to make us all aware.

rtomayko commented 13 years ago

Blah. Yeah that's a whole other dimension to this I haven't considered. We may need Tilt.default_internal_encoding as well, with semantics similar to Tilt.default_external_encoding.

DAddYE commented 13 years ago

@rtomayko I suggest to keep it more strict only to Encoding.default_external, at the moment I don't know when can be useful have two separate encodings, but is true that we don't break compatibility.

Always use magic comment when found in template (guess the engine needs to support this?)

As said I don't love this, but maybe others would like to use it.

nesquena commented 13 years ago

Interesting, wasn't aware of that warning when setting default_external. We would still be able to access the default encoding type through that even if it isn't expected to be set explicitly. Relying on the default external encoding and encoding everything to that internally strikes me as the most consistent approach that is least likely to be an annoyance for the end user developer as @rkh noted above.

/cc @raggi (would be interested to hear your thoughts)

judofyr commented 13 years ago

Always use magic comment when found in template (guess the engine needs to support this?)

Instead of this I think we should just have a method on Tilt::Template (detect_encoding maybe) which returns nil by default and can be overridden by the template mapper. It's up to the template mapper to decide if it wants to use magic comments or if it just defaults to BINARY.

judofyr commented 13 years ago

After figuring out the encoding, should we leave it by that or call String#encode (with no arguments) to convert everything to the same encoding internally? It's what Rails does and @wycats recommends, and should ease dealing with templates in different encoding (imagine templates shipping with a gem and such) but it might mess with the all-binary scenario.

Well, the output from the template engine isn't guaranteed to be a string. It could be an object or an Array (think Rack-streaming) instead. Beside, it's pretty simple for anyone who uses Tilt to add an #encode at the end.

rkh commented 13 years ago

By the way: "@konstantinhaase any reason not to just extract what rails is doing? I spent a lot of time on it and don't want to deal with the bugs again." - @wycats - http://twitter.com/wycats/status/113458749854842880

rtomayko commented 13 years ago

@rkh Which is ... ?

DAddYE commented 13 years ago

Bugs?

rkh commented 13 years ago

@rtomayko: dunno, haven't looked into it yet.

-- Gesendet von meinem Palm Prē Ryan Tomayko <reply@reply.github.com> schrieb am 13.09.2011 13:02:

@rkh Which is ... ?

Reply to this email directly or view it on GitHub:

https://github.com/rtomayko/tilt/issues/75#issuecomment-2085839

judofyr commented 13 years ago

I think it's this: https://github.com/rails/rails/blob/master/actionpack/lib/action_view/template.rb#L11

brianmario commented 13 years ago

+1 to what rails is doing

rtomayko commented 13 years ago

@judofyr Thanks for running that down. Most of AV's behavior seems to make good sense with Tilt, although I think we need a way to enable / disable some of it.

To summarize AV's behavior:

Determining template source encoding:

Template source transcoding:

Template source to Ruby source conversion:

All that's left is calling the method, which should return a string encoded as default_internal.

AV seems to assume Encoding.default_internal will be set. Tilt needs to work when default_internal is nil, which complicates things somewhat.

One thing that absolutely should happen is: for anything that tilt reads from disk using an IO object it creates, do something like:

data.force_encoding(magic || default_encoding || Encoding.default_external)
data.encode!

Note that the transcode step is a noop when Encoding.default_internal is nil.

Some other things we should consider supporting:

The main issue I'm struggling with in all this is conversion to default_internal of template source strings returned from a reader block. I really hoped we could punt that responsibility to the caller but it doesn't work out. Here's my reasoning:

  1. Evaluating all templates in utf-8 / default_internal regardless of external encoding is a worthwhile feature and makes sense for many environments.
  2. You can't get that behavior from IO's normal external to internal conversion because of magic comments (especially template specific ones).
  3. The caller can't be responsible for conversion because they might not know about template specific magic comments either, nor will they want to deal with them even when they do know.
  4. The only option remaining is for Tilt to be responsible for performing external to internal conversion.

Essentially I think we should use Encoding.default_internal as a flag that says "always convert template source from the detected template source encoding to default_internal" regardless of where the source comes from. Perhaps additionally an option is needed to turn that behavior off at the Template instance level; otherwise, evaluating templates under other encodings becomes impossible when Encoding.default_internal is set.

josh commented 13 years ago

To some extent, encodings should be handled individually by each handler. Not saying tilt shouldn't do anything, but I think each handler class should opt into the magic comment syntax and automatic transcoding. Not everything needs this. The coffee handler only works with utf-8, you can't write CS in anything else. Also sass/less can handle encodings via @charset css directive.

Think the important part is defining the contact for how all handlers should be expected to behave.

Ryan's outline looks good.

rtomayko commented 13 years ago

@josh Good to know. Yeah, I think we can get some basic stuff into Template without too much work but then we're really going to need to audit each template engine individually.

apotonick commented 13 years ago

On Thu, Sep 15, 2011 at 2:50 AM, Ryan Tomayko reply@reply.github.com wrote:

@josh Good to know. Yeah, I think we can get some basic stuff into Template without too much work but then we're really going to need to audit each template engine individually.

Are you talking about Tilt::Template or AV::Template here?

Reply to this email directly or view it on GitHub: https://github.com/rtomayko/tilt/issues/75#issuecomment-2099903

rtomayko commented 13 years ago

A more concrete proposal with code: #107

argent-smith commented 12 years ago

Sorry guys that I suddenly appear from my bear lair (actually I've encountered the problem yesterday and saw that almost everybody stopped discussing it by May), but ... can anybody say how to conquer this issue when I have it with Padrino stack? I use haml/markdown/ruby_vars in UTF-8. How to settle them all?

nesquena commented 12 years ago

@argent-smith Try:

# config/boot.rb
Padrino.after_load do
  Encoding.default_internal = nil
end

does that help?

argent-smith commented 12 years ago

It helped until I used the single UTF-8 encoded variable. So now I have UTF-8 haml rendered well and "ncompatible character encodings: UTF-8 and ASCII-8BIT" when I try to do "= utf8_string_returning_method" in the haml template.

nesquena commented 12 years ago

The next step would be to try and create a simple stripped down test case of this failure, preferably in Sinatra as a reference. Then if you can, check out this tilt patch from @rtomayko: https://github.com/rtomayko/tilt/pull/107/files, and see if with that applied the issue goes away.

argent-smith commented 12 years ago

I'll finish the test suite by tomorrow.

argent-smith commented 12 years ago

See my test suite argent-smith/sinatra-test.

It looks like on clean Sinatra with stock Tilt the issue isn't reproduced: everything goes just fine. I'll make yet another test suite with Padrino and tell you what happens.

argent-smith commented 12 years ago

Well, after adding markdown to my test I've reproduced the issue. Actually markdown was in my app when I've first encountered the issue. The error doesn't depend on a particular markdown implementation: I've tried kramdown and redcarpet with the same result. For the further details see my test suite mentioned above.

nesquena commented 12 years ago

Great, I was hoping you could isolate this test with sinatra alone, since Padrino doesn't really mess with encodings. So the above repo fails with stock Tilt with Sinatra? Can you try again with the patched Tilt that Ryan has a pull request for and see if that makes the issue go away?

argent-smith commented 12 years ago

I'll try it ASAP

argent-smith commented 12 years ago

@nesquena @rtomayko I've got a conflict when tried to merge encodings to master: see /lib/tilt/template.rb in gist 1374322. Would you tell me how to resolve it?

djanowski commented 12 years ago

So... are we going to honor Encoding.default_external?

argent-smith commented 12 years ago

@djanowski doesn't help in my case (xcuse my stupidity ;-)

apohllo commented 12 years ago

Any progress with that issue?