linkeddata / cimba

Client-Integrated Micro-Blogging Architecture application
MIT License
100 stars 26 forks source link

Detect Hyperlinks #3

Closed melvincarvalho closed 10 years ago

melvincarvalho commented 10 years ago

It would be nice if posts could detect hyperlinks and make them clickable.

I'll have a look to see if I can find a js library that does this.

sandhawke commented 10 years ago

This actually opens some tricky side issues related to HTML content. Hrm.

melvincarvalho commented 10 years ago

I noticed it's already implemented (this morning) :) [ Closed Issue ]

Having thumbnail previews, however, would be really cool. Not sure it's possible without a server/CDN tho ...

sandhawke commented 10 years ago

On 03/19/2014 10:58 PM, melvincarvalho wrote:

I noticed it's already implemented (this morning) :) [ Closed Issue ]

Having thumbnail previews, however, would be really cool. Not sure it's possible without a server/CDN tho ...

— Reply to this email directly or view it on GitHub https://github.com/social-webarch/cimba/issues/3#issuecomment-38105930.

Should the thumbnail be generated by the sender or the receiver? Much more efficient and appropriate to have it be the sender, but do we trust them?

deiu commented 10 years ago

There is a safe filter I've added, that just creates http links. It doesn't parse any other html.

sandhawke commented 10 years ago

On 03/19/2014 11:07 PM, Andrei wrote:

There is a safe filter I've added, that just creates http links. It doesn't parse any other html.

— Reply to this email directly or view it on GitHub https://github.com/social-webarch/cimba/issues/3#issuecomment-38106936.

Yeah, if the message is HTML we've got a real problem. As far as I've been able to glean, caja is the only acceptable way to handle 3rd party HTML. We could try using it, but it seems like such a poor architecture...

slorber commented 10 years ago

@sandhawke @deiu we use wysihtml5 rich text editor at Stample and it works fine.

It handles copy/paste/sanitize of html content inside a elem. You can use it with a toolbar, or without like a normal textarea.

You can define custom parser rules to tell the editor which html elements and class you want to keep in in the sanitized html. https://github.com/xing/wysihtml5/blob/master/parser_rules/advanced.js

It automatically handles copy/paste of html content with links too normally, while I'm not sure you can do that with a normal textarea. Check the demo here: http://xing.github.io/wysihtml5/examples/advanced.html and try to copy some html content in it.

sandhawke commented 10 years ago

The HTML production is only a usability concern. This may address that. What I'm worried about is the consumption of HTML. What happens if someone posts some malicious HTML to their microblog and cimba tries to display it for some reader? If the malicious HTML gets control, it could do things like make cimba post some viral and malicious content on the readers own microblog. My understanding from experts is that sanitizing HTML is extremely hard, and caja is the only software that does even a plausible job.

   -- Sandro

On 03/20/2014 11:38 AM, Sébastien Lorber wrote:

@sandhawke https://github.com/sandhawke @deiu https://github.com/deiu we use wysihtml5 rich text editor at Stample and it works fine.

It handles copy/paste/sanitize of html content inside a

elem. You can use it with a toolbar, or without like a normal textarea.

You can define custom parser rules to tell the editor which html elements and class you want to keep in in the sanitized html. https://github.com/xing/wysihtml5/blob/master/parser_rules/advanced.js

It automatically handles copy/paste of html content with links too normally, while I'm not sure you can do that with a normal textarea. Check the demo here: http://xing.github.io/wysihtml5/examples/advanced.html and try to copy some html content in it.

— Reply to this email directly or view it on GitHub https://github.com/social-webarch/cimba/issues/3#issuecomment-38148627.

slorber commented 10 years ago

Wysihtml5 automatically cleans malicious JS code like script elements etc.

It is really not hard to sanitize HTML content. It is better to use a tool which blacklist every html content by default and to add whitelisted elements (like wysihtml5 do). Unless you whitelist script it won't be a problem. You can whitelist only very specific elements like span, div, img, href...

Note that Caja seems to work only with JS. Sanitizing html content for security reasons should be done on the backend server because one could modify the browser's code (and could manually remove the sanitizer). There are many tools to do that on the backend, like JSoup, JTidy or NekoHTML for the Java JVM.

The only point of sanitizing in the browser is that you could auto sanitize the content as the user add new html content. You could use the same sanitizing logic in the browser + in the server. This permits to give the user a preview of what will really be saved in the server, so that the content in the editor looks really like what others will finally see once they render the html content.

So:

I'm don't really know who are your experts but sanitizing is really not hard, it's just parsing html as dom and filtering elements, nothing more. Many tools do that job well for years you can trust me and they also work well with malformed HTML content.

The only stuff I've never seen is a high perf sanitizer based on streaming APIs like StAX

slorber commented 10 years ago

As it seems hard to create a distributed backend sanitizing system for RWW (unless some sanitizing protocol is added to standards?), maybe it's possible to not sanitize anything on the backend (so potentially persist insecure html content), and it could be the responsability of the browser to sanitize the html to render before adding it to the browser dom.

presbrey commented 10 years ago

If the "message" is HTML, you PUT message.html to your PDS and add a triple linking to the HTML content. Then the client shows the HTML URL in an iframe and normal CORS rules apply.

sandhawke commented 10 years ago

On 03/20/2014 07:27 PM, presbrey wrote:

If the "message" is HTML, you PUT message.html to your PDS and add a triple linking to the HTML content. Then the client shows the HTML URL in an iframe and normal CORS rules apply.

— Reply to this email directly or view it on GitHub https://github.com/social-webarch/cimba/issues/3#issuecomment-38196251.

It seems like putting every cimba message in an iframe could cause problems with the appearance of the app, but maybe it would work. I don't know how styling works with iframes. Also, the way you're suggesting it, it would have origin=your PDS, which isn't where I'd expect to find malicious code. Aren't there situations where we want to be trusting that origin?

deiu commented 10 years ago

AngularJS has mechanisms to sanitize the html content in posts. That's not an issue. I think the only question at this point is whether we want to enable this feature or not. The immediate benefit would be that we can have rich posts (including markdown support).

sandhawke commented 10 years ago

On 03/20/2014 07:52 PM, Andrei wrote:

AngularJS has mechanisms to sanitize the html content in posts. That's not an issue. I think the only question at this point is whether we want to enable this feature or not. The immediate benefit would be that we can have rich posts (including markdown support).

Oh good, the HTML sanitizing landscape has improved since I last looked.

Yeah, so now the question is (1) what tags and elements should we allow (if any), and (2) what does it do to interoperability if we allow html? Very interesting questions. It might be good practice for the poster to include a plaintext version along with the HTML, at least.

   -- Sandro
deiu commented 10 years ago

The tags don't really matter. The sanitizer is smart enough to forbid dangerous elements (i.e. scripts). From an interop p.o.v., I think most blog websites use html in the body, so being able to display that would be nice!

sandhawke commented 10 years ago

On 03/20/2014 09:18 PM, Andrei wrote:

The tags don't really matter. The sanitizer is smart enough to forbid dangerous elements (i.e. scripts). From an interop p.o.v., I think most blog websites use html in the body, so being able to display that would be nice!

The tags do matter for interoperability. If some clients allow different elements than others, authoring content will be very hard.
And there are lots of hard decisions to make? Can people authoring content use iframes? Can they use frames? Can they use images?
Can they use forms? RDFa? Microdata? headers and footers?
keygen? Okay, probably not keygen. Can they set the font size to 144 point, and blinking?

Perhaps we could start with a minimal set like b, i, p, and blockquote, all with no attributes. And whatever might be needed for proper bidi text. And an outer div/span element with a lang attribute, so we know the language of the message.

There will be a lot of call for images, but images let the sender track the reader more than is often expected or desired.

deiu commented 10 years ago

The sanitizer takes care of most (all?) scenarios. For example, it will display the images (though we may want to set a max width/height in the css file to avoid unexpected results such as pictures going outside the post div), it will NOT display embedded javascript, iframes, form elements, blinking text, huge font size, etc.

I could probably test more cases but so far everything seems to be OK.