Strange protocol cut - Githubissues

puzrin commented 9 years ago

 > require('autolinker').link('qqqqqqqqqqqq://hello.world')
'qqq<a href="qqqqqqqqq://hello.world" target="_blank">qqqqqqqqq://hello.world</a>'

Seems several errors at once.

I don't know, if autolinker should support alternate protocol schemas (skype, ftp, news, irc ...). If not - it should not make false positives on example above.
If yes, it should not break protocol name.

gregjacobs commented 9 years ago

Hey, do agree. Autolinker does look for protocols between 3 and 9 characters, so that's why it's getting cut off. Thinking maybe it should just take as many characters as are there

puzrin commented 9 years ago

https://github.com/jonschlinkert/remarkable/blob/dev/lib/common/url_schemas.js

Schema can be longer, and without //

gregjacobs commented 9 years ago

Wow, yeah. I'm also not currently including dashes or dots. Any other characters that are accepted that you happen to know of off-hand?

puzrin commented 9 years ago

Take a look at IANA spec and CommonMark discussion. IANA has a lot of garbage, i'm not sure that all needed. But it has examples for each protocol.

Also, if you plan to rewrite - consider moving text operations to separate module. Here is full list of "questions", we encounted while use with remarkable.

gregjacobs commented 9 years ago

Good to know. Seems like this is the spec on the scheme name: http://tools.ietf.org/html/rfc3986#section-3.1

When you say "text operations," are you looking for just the find/replace part? Judging from https://github.com/jonschlinkert/remarkable/issues/108, I gather that you don't need the html parsing part for your purposes?

Also, if I understand correctly from your use of Autolinker in your source code, are you just using it for its parsing capabilities? Would having a separate Parser class help? (Not sure how quickly I'd be able to implement something like this, but if that's your intention then I can try to move in that direction when working on the project :))

puzrin commented 9 years ago

Yes, we apply autolinker on AST. For remarkable it will be best to have text scanner only (even without replace) + url generator. That will reduce output on browserification. The most convenient would be to have it in separate package. But having a separate "requireable" file will be good too, if you keep stable interface/filename on build version change (..X).

Also, you have suspicious bug about 100% CPU load. That can ddos server if it really happens.

I don't have a bit desire to create one more package - already have a tons to develop & maintain. IMHO, your one is the best of all i seen, but needs some care.

gregjacobs commented 9 years ago

Thanks, but yeah, it def needs some work. I only get to work on it every so often, so there are def some things that have fallen by the wayside. The 100% cpu bug is def an interesting one. Freezes in the regex engine itself. You're totally right about the possibility of a dos attack though, I'll def look into that asap (have a feeling it has to do with the part I added to handle <!doctype> tags)

Would love to help you out in general though, and save you from writing your own. I'll try to devote some time this week to fixing a few of these. The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

gregjacobs commented 9 years ago

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

puzrin commented 9 years ago

The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

Full support with surrogate chars is not easy, but possible.

use xregexp will solve problem, but significantly increase size for browser.
it's possible to extract expression for Letter class (~2kb), and do/generate other things manually.
see more generic validation expression with international chars support https://github.com/flatiron/revalidator/blob/master/lib/revalidator.js#L152 (will not be ok here but give idea what happens)
search "javascript regexp unicode" to find info about all caveats.

Personally, i don't like recursive expressions, because those are too easy to ddos. For example, with long patterns like ((((((((((((((((((((((((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))))))))))))). In pair with unicode support it can be more easy to write scanner manually.

https://github.com/ljosa/urlize.js - one more package, trying to solve the same problem in correct way, but with similar problems in unicode support. May be useful for url scanner redesign.

puzrin commented 9 years ago

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

will it match javasctipt/vbscript ? that would be a problem (not for remarkable but for your users). Also keep in mind, that those can be masked by ulr-encoding https://github.com/jonschlinkert/remarkable/blob/dev/test/fixtures/remarkable/xss.txt#L73

puzrin commented 9 years ago

https://gist.github.com/puzrin/ce95a25581a4d069e173

That's only my attempt to separate data abstraction levels and summary of known caveats. Done for personal memos, but can be useful for you too.

Reason why separation can be useful:

Unicode deals will be kept separate. No need to care at html parse phase
I doubt, that it's possible to implement secure html parser easy.It would be more simple to use DOM in browser, and cheerio for node
- minimal total size for browser
- no additional problems with security (parse quality)
possibility to provide additional rules & options (like @twitter names replace) without complicating matcher interface and defaults.

gregjacobs commented 9 years ago

will it match javasctipt/vbscript ?

Yes :( I guess I should explicitly ignore those.

Interesting info otherwise though. Some thoughts:

I'm thinking that I wouldn't want to introduce the overhead of a DOM implementation just for html parsing. Really all that Autolinker is interested in from an HTML perspective is to not auto-link within html tags' attributes, and to not auto-link inside any descendant of an <a> tag. One could use an html purifier after the fact for security purposes, if he/she would like to.
XRegExp looks awesome, thanks for the link. Of course, it is extra size.. I did create an issue a couple of weeks ago to use Google Closure Compiler for minification though, which may reduce the size of Autolinker significantly enough to warrant its use. Otherwise, perhaps extracting just the unicode part might be the way to go.

What I'm thinking is that I'll work on fixing the current bugs, and toward extracting a separate module that performs the search/replace. That part should definitely be able to be abstracted. Might make it easy to introduce the unicode char support at that time too.

puzrin commented 9 years ago

May be, as temporary solution, option to skip html parse? I guess, regexp bug is not in matcher.

gregjacobs commented 9 years ago

Hey, so I fixed the initial issue of this ticket in v0.14.1. Going to close this ticket for now, and will open new ones for the other items.

For skipping the HTML parse, I'm thinking that I would prefer not to add another option, but give me a day or two, I might be able to figure out this regex bug. Otherwise, I definitely want to give you a straight interface to the search/replace functionality that will be independent of the html parsing.

puzrin commented 9 years ago

No prob.

Just for info, sindresorhus adviced this package https://github.com/webmodules/urlregexp . It's a bit big (5kb gzipped), but very useful to understand IDL support and logic.

This can be rewritten to parse without regexp, to optimize speed.

gregjacobs commented 9 years ago

Wow, yeah, quite a regex. Def could help though, thanks.

Kagami commented 9 years ago

Hi. I see you guys here trying to filter danger protocols like "javascript" and "vbscript". Currently this is done by this code: return ( uriScheme !== 'javascript:' && uriScheme !== 'vbscript:' );. I don't think this is a mature solution, I believe the only way to deal with scheme filtering is a whitelist (maybe customized by user). E.g. look at this scary big list: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet (all this case-insensitive and inserting meaningless characters technics). I guess many of them don't work in modern browsers but it's better to be safe than sorry.

gregjacobs / Autolinker.js

Strange protocol cut #67