gregjacobs / Autolinker.js

Utility to Automatically Link URLs, Email Addresses, Phone Numbers, Twitter handles, and Hashtags in a given block of text/HTML
MIT License
1.48k stars 238 forks source link

Strange protocol cut #67

Closed puzrin closed 9 years ago

puzrin commented 9 years ago
 > require('autolinker').link('qqqqqqqqqqqq://hello.world')
'qqq<a href="qqqqqqqqq://hello.world" target="_blank">qqqqqqqqq://hello.world</a>'

Seems several errors at once.

  1. I don't know, if autolinker should support alternate protocol schemas (skype, ftp, news, irc ...). If not - it should not make false positives on example above.
  2. If yes, it should not break protocol name.
gregjacobs commented 9 years ago

Hey, do agree. Autolinker does look for protocols between 3 and 9 characters, so that's why it's getting cut off. Thinking maybe it should just take as many characters as are there

puzrin commented 9 years ago

https://github.com/jonschlinkert/remarkable/blob/dev/lib/common/url_schemas.js

Schema can be longer, and without //

gregjacobs commented 9 years ago

Wow, yeah. I'm also not currently including dashes or dots. Any other characters that are accepted that you happen to know of off-hand?

puzrin commented 9 years ago

Take a look at IANA spec and CommonMark discussion. IANA has a lot of garbage, i'm not sure that all needed. But it has examples for each protocol.

Also, if you plan to rewrite - consider moving text operations to separate module. Here is full list of "questions", we encounted while use with remarkable.

gregjacobs commented 9 years ago

Good to know. Seems like this is the spec on the scheme name: http://tools.ietf.org/html/rfc3986#section-3.1

When you say "text operations," are you looking for just the find/replace part? Judging from https://github.com/jonschlinkert/remarkable/issues/108, I gather that you don't need the html parsing part for your purposes?

Also, if I understand correctly from your use of Autolinker in your source code, are you just using it for its parsing capabilities? Would having a separate Parser class help? (Not sure how quickly I'd be able to implement something like this, but if that's your intention then I can try to move in that direction when working on the project :))

puzrin commented 9 years ago

Yes, we apply autolinker on AST. For remarkable it will be best to have text scanner only (even without replace) + url generator. That will reduce output on browserification. The most convenient would be to have it in separate package. But having a separate "requireable" file will be good too, if you keep stable interface/filename on build version change (..X).

Also, you have suspicious bug about 100% CPU load. That can ddos server if it really happens.

I don't have a bit desire to create one more package - already have a tons to develop & maintain. IMHO, your one is the best of all i seen, but needs some care.

gregjacobs commented 9 years ago

Thanks, but yeah, it def needs some work. I only get to work on it every so often, so there are def some things that have fallen by the wayside. The 100% cpu bug is def an interesting one. Freezes in the regex engine itself. You're totally right about the possibility of a dos attack though, I'll def look into that asap (have a feeling it has to do with the part I added to handle <!doctype> tags)

Would love to help you out in general though, and save you from writing your own. I'll try to devote some time this week to fixing a few of these. The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

gregjacobs commented 9 years ago

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

puzrin commented 9 years ago

The one thing I'm unfamiliar with though is non-english chars in urls. Do you have any experience with this?

Full support with surrogate chars is not easy, but possible.

Personally, i don't like recursive expressions, because those are too easy to ddos. For example, with long patterns like ((((((((((((((((((((((((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))))))))))))). In pair with unicode support it can be more easy to write scanner manually.

https://github.com/ljosa/urlize.js - one more package, trying to solve the same problem in correct way, but with similar problems in unicode support. May be useful for url scanner redesign.

puzrin commented 9 years ago

Hey, FYI, just published v0.14.0, which should match the full scheme names. Give that a try and let me know.

will it match javasctipt/vbscript ? that would be a problem (not for remarkable but for your users). Also keep in mind, that those can be masked by ulr-encoding https://github.com/jonschlinkert/remarkable/blob/dev/test/fixtures/remarkable/xss.txt#L73

puzrin commented 9 years ago

https://gist.github.com/puzrin/ce95a25581a4d069e173

That's only my attempt to separate data abstraction levels and summary of known caveats. Done for personal memos, but can be useful for you too.

Reason why separation can be useful:

gregjacobs commented 9 years ago

will it match javasctipt/vbscript ?

Yes :( I guess I should explicitly ignore those.

Interesting info otherwise though. Some thoughts:

What I'm thinking is that I'll work on fixing the current bugs, and toward extracting a separate module that performs the search/replace. That part should definitely be able to be abstracted. Might make it easy to introduce the unicode char support at that time too.

puzrin commented 9 years ago

May be, as temporary solution, option to skip html parse? I guess, regexp bug is not in matcher.

gregjacobs commented 9 years ago

Hey, so I fixed the initial issue of this ticket in v0.14.1. Going to close this ticket for now, and will open new ones for the other items.

For skipping the HTML parse, I'm thinking that I would prefer not to add another option, but give me a day or two, I might be able to figure out this regex bug. Otherwise, I definitely want to give you a straight interface to the search/replace functionality that will be independent of the html parsing.

puzrin commented 9 years ago

No prob.

Just for info, sindresorhus adviced this package https://github.com/webmodules/urlregexp . It's a bit big (5kb gzipped), but very useful to understand IDL support and logic.

This can be rewritten to parse without regexp, to optimize speed.

gregjacobs commented 9 years ago

Wow, yeah, quite a regex. Def could help though, thanks.

Kagami commented 9 years ago

Hi. I see you guys here trying to filter danger protocols like "javascript" and "vbscript". Currently this is done by this code: return ( uriScheme !== 'javascript:' && uriScheme !== 'vbscript:' );. I don't think this is a mature solution, I believe the only way to deal with scheme filtering is a whitelist (maybe customized by user). E.g. look at this scary big list: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet (all this case-insensitive and inserting meaningless characters technics). I guess many of them don't work in modern browsers but it's better to be safe than sorry.