markdown-it / linkify-it

Links recognition library with full unicode support
http://markdown-it.github.io/linkify-it/
MIT License
666 stars 63 forks source link

Handling of simple hostname with digit at end #34

Closed jugglingcats closed 8 years ago

jugglingcats commented 8 years ago

http://markdown-it.github.io/linkify-it/#t1=http%3A%2F%2Ftest1%2Ftest.pdf

puzrin commented 8 years ago

https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names

Such hostnames are invalid, according to RFC.

puzrin commented 8 years ago

I see no practical reasons to add such case to linkifier defaults.

jugglingcats commented 8 years ago

Thanks for quick response! I don't see anything in the RFC restrictions that prohibit a name like "test1" as a hostname. It says hostname must consist of letters and digits, and talks about dot delimeters, but gives examples of "unqualified hostname such as csail or wikipedia".

It is quite common to use unqualified hostnames within internal networks and therefore within internal documentation (intranet), and often these hosts will have digits, eg. nexus3, intranet2, etc.

Is there any way to override/customise linkify-it to handle this case (in the context of vanilla markdown-it -- bower version)?

Many thanks for a great library.

puzrin commented 8 years ago

Sure, you can modify everything you wish. All rules and partials are exposed into class instance, and you can override those with your own. It's better to dig code for details.

jugglingcats commented 8 years ago

Any tips on where to look? Can it be done on linkify-it object or do I need to implement completely custom linkification. Code is well structured but complex so not always clear the best place to intercept/override. Have only been using markdown-it for 24hrs... ;)

puzrin commented 8 years ago

You are right. Seems i've missunderstood hostname requirements, and this bug should be fixed.

Do i understand right, that even pure digits without letters are ok? (http://123/foo, http://123.local/foo, http://123.example.com)

Any tips on where to look?

If you ever decide to customize rules, look at existing ones https://github.com/markdown-it/linkify-it/blob/1.2.1/index.js#L50.

Regex stubs are available in .re property of linkifier instance. See what happens in .compile().

jugglingcats commented 8 years ago

I don't know if just '123' as a hostname is valid according to the RFC, but assume someone could put it in their private DNS or even host file entry.

There are definitely pure 'digit' hostnames out on the web: 999.com is resolvable.

And hostnames starting with digits are quite common, eg: https://123-reg.co.uk.

Thanks for the code pointers. Will take a look...

puzrin commented 8 years ago

As far as i understand, pending changes are:

Thoughts? Can this give false positive?

jugglingcats commented 8 years ago

Your call, but not sure why wouldn't you treat http://999/index.html as a link, or even just http://999 for that matter? They look like clear links to me...

As for // - I wouldn't use it myself, especially if there are different linkify rules for it.

puzrin commented 8 years ago

Ok, reasonable.

jugglingcats commented 8 years ago

Looks good to me!

It should work for all allowed protocols (for example, we are using linkify.add("rest:") to support custom links types).

puzrin commented 8 years ago

If you have a time - please, post here links that are not working now and should be fixed.

jugglingcats commented 8 years ago

So far the only one I found is trailing digit in the hostname. The linkifyier is even working with REST paths with variables, such as http://markdown-it.github.io/linkify-it/#t1=http%3A%2F%2Ftest%2Ftest%2F%7Bname%7D.

But I will post any more I find. Thank you.

puzrin commented 8 years ago

There is a problem with such link, if we allow dimain parts to have all digits:

https://www.google.ru/maps/@59.9393895,30.3165389,15z?hl=ru

Part before @ is considered as username and 59.9393895 as domain. Then , terminates scan because the rest is invalid path.

Need to fix situation with @ somehow.