jarrett / rbbcode

Converts BBCode to HTML. Gracefully handles invalid input. Built on Treetop.
41 stars 9 forks source link

Preserve invalid tags? #6

Closed cheald closed 11 years ago

cheald commented 14 years ago

Howdy. I'm looking for some mechanism to tell rbbcode to not strip out unrecognized markup. Currently, anything in square brackets tends to get stripped, including stuff in [code] tags and the like; I'd like it to be preserved it it's not a registered tag. Does any such option exist?

jarrett commented 14 years ago

RbbCode is very strict by design. It's pretty much a police state. I believe giving users access to any kind of markup is a very dangerous game, so you have to be conservative and use a whitelist approach.

Thus, all unrecognized tags are stripped. This cannot be turned off, as it's baked deeply into the design. RbbCode isn't a find-and-replace engine. I.e. it doesn't substitute HTML tags for BBCode tags. Instead, it parses the BBCode tags into a document tree and then generates HTML from that. Thus, tags only appear in the output if the parser knows how to handle them.

Luckily, RbbCode is customizable. You can define your own tags and write methods for converting them to HTML. If there's a particular tag you need that's not supported out of the box, you can easily add support for it.

Code tags are supposed to work out of the box, though. Can you provide a failing example for code tags?

cheald commented 14 years ago

I apologize for the terse initial report - I filed it as I was running out for the evening.

I'm aware that the technique RbbCode uses isn't quite so blunt as a search-and-replace, but it does cause problems in practical usage; a whitelist isnt needed in the specific case of BBCode. I chose BBCode specifically because allowing arbitrary BBCode isn't dangerous, as it isn't natively interpreted by browsers. I can let a user say [blazarg script="omg_xss_hax()"] and the browser isn't going to do anything dangerous with it.

I've implemented a few of my own tags already, and they work marvelously, but the problems is that I'm getting non-tag content (which looks like a bbcode tag) being stripped. For example:

"[Smith] said that there is only one way to go"

This is commonly used to provide additional context for a quote when the original quote doesn't make sense on its own; RbbCode interprets this as:

" said that there is only one way to go"

The valid (non-tag) content, [Smith], is stripped from the parsed result, because RbbCode doesn't know what to do with it. Additionally, parsing continues inside of [code] blocks:

[code]
"[Smith] said that there is only one way to go"
[/code]

Results in:

<pre rel='code'>"
 said that there is only one way to go"
</pre>

Obviously, this makes RbbCode somewhat unusable in many practical scenarios, because I can't assume that my users will only use square brackets to denote markup; they might be used in a code example (to demonstrate a Ruby array, perhaps), or in a quotation, or in many other circumstances. I really like the library, but this is a dealbreaker for me - if I can't manage to get it to preserve non-recognized tags in the output, I'll have to use a different solution.

cheald commented 14 years ago

I've worked around it in my code, somewhat, by just preprocessing the text string before parsing it with RbbCode. I'm not pleased with the solution, but it works for now. Ideally, I'd like to see an option to tell RbbCode to simply interpret any BBCode tags it doesn't recognize as text tokens.

def prepare_text(text, schema)
  tags = schema.instance_variable_get :@allowed_tags
  text.strip.gsub(/\[([^\/][^\] =]+)(.*?)\]/) do |match|
    if !tags.include?($1.downcase) then
      match.gsub(/^\[/, "&#091;")
    else
      match
    end
  end
end
jarrett commented 14 years ago

I see your point now. That's interesting. You're right that arbitrary text inside square brackets wouldn't be a security flaw.

My thought was that BBCode should work like HTML or XML: anything that looks like markup should be treated as such. And I think I want to continue approaching it that way.

But, that raises the question of how people are supposed to escape their square brackets. I hadn't given it any thoughts until now. What do you think?

"they might be used in a code example"--fortunately, RbbCode knows to treat anything inside a code tag as a literal string. So this shouldn't be a problem. Let me know if I'm wrong about that,

cheald commented 14 years ago

I think that I disagree with the "anything that looks like markup is treated like such" approach. BBCode is specifically targeted towards a fairly low common denominator, and is generally employed in places where you don't want people to have to worry about escaping input and such. There isn't any real benefit for greedily parsing everything that looks like bb markup, but there is a specific drawback to it.

RbbCode is not respecting the source of code tags as a literal string. It is stripping what it thinks is markup.

>> RbbCode::Parser.new.parse("[code]classes_to_parse = [User][/code]")
=> "<p><code>classes_to_parse = </code></p>"

As far as escaping, if you're going to go that route, I'd have it respect [[foobar]] as an escaped tag, I think. That's easier to visually parse than [foobar] or whatnot, and is a bit easier to educate users about.

jarrett commented 14 years ago

You make a good case. I think you've pretty much sold me on your point of view.

It should be possible to have the parser spit out unrecognized tags as markup. I think I'll have to modify Schema#valid_in_context? to return true for unrecognized tags, and HtmlMaker to render unrecognized tags as literal text.

Thanks for pointing out the [code] issue. I thought I had that working--see the "treats tags other than the closing tag as literals" example in parser_spec.rb. I think the issue is that it's treating recognized tags as literals, but (stupidly) dropping unrecognized tags. There's no good reason for that, so it's a bug that needs to be fixed.

I don't have a lot of time to work on this now, but I hope to be able to get to this sometime soon.

iGEL commented 13 years ago

Any news about this? I agree with cheald, invalid tags should be preserved.

jarrett commented 13 years ago

Sorry folks--it's been forever since I worked on this. I just haven't had a need for this gem lately, and I haven't had time to work on it. I can't make any promises about when this stuff will get fixed.

In the meantime, you're certainly welcome to fork it and try to implement these changes. I'm not sure what the best approach would be. I hope it won't require a major architectural change.

jarrett commented 11 years ago

A new version based on Treetop will be coming out soon. This is a total rewrite, so all current issues will be closed.

The new version will probably pass through unrecognized tags verbatim. So that should accomplish exactly what you're hoping for.