markdown-it / linkify-it

Links recognition library with full unicode support
http://markdown-it.github.io/linkify-it/
MIT License
651 stars 63 forks source link

Links with "_" in the domain name are not regarded as links #95

Open ZibanPirate opened 3 years ago

ZibanPirate commented 3 years ago

what is the issue?

Links with "_" in the domain name, for eg:

are not regarded as links, which is no true, see : https://stackoverflow.com/a/2183140/8113942

the same goes for fuzzy links, for eg:

rlidwka commented 2 years ago

As far as I've been able to research, api_stage.dzcode.io is an alias for api-stage.dzcode.io, and dz_code.io simply isn't a thing.

Please provide an example of widely used domains with underscores in them.

Underscores in domain names are very rare because:

Linkify-it isn't meant to find every single link (which is impossible), so we have to restrict ourselves to the most common cases. I'm not sure if domains with underscores are worth supporting, especially given false-positive potential of them being introduced in fuzzy links.

domakas commented 7 months ago

Is it possible we get this resolved already? It seems like we are discussing whether this is a valid case or not, but it's obvious that there are cases like this around the web. This library has 100% test coverage, so it's safe to add this change without worrying it would break something. We hear "false-positive potential" mentioned before, but what are the exact cases which could be false-positives?

There is also other option that gets suggested - to use onCompile to override src_domain regexp, however, since most of the regexps are dependant on one of another this simple change needs to be applied like this:

LinkifyIt.prototype.onCompile = function onCompile() {
  const re = this.re;
  const text_separators = '[><\uff5c]';

  re.src_domain =
    '(?:' +
    re.src_xn +
    '|' +
    '(?:' + re.src_pseudo_letter + ')' +
    '|' +
    '(?:' + re.src_pseudo_letter + '(?:-|_|' + re.src_pseudo_letter + '){0,61}' + re.src_pseudo_letter + ')' +
    ')';

  re.src_host =
    '(?:' +
    '(?:(?:(?:' + re.src_domain + ')\\.)*' + re.src_domain/* _root */ + ')' +
    ')';

  re.tpl_host_fuzzy =
    '(?:' +
    re.src_ip4 +
    '|' +
    '(?:(?:(?:' + re.src_domain + ')\\.)+(?:%TLDS%))' +
    ')';

  re.src_host_strict =
    re.src_host + re.src_host_terminator;

  re.tpl_host_fuzzy_strict =
    re.tpl_host_fuzzy + re.src_host_terminator;

  re.src_host_port_strict =
    re.src_host + re.src_port + re.src_host_terminator;

  re.tpl_host_port_fuzzy_strict =
    re.tpl_host_fuzzy + re.src_port + re.src_host_terminator;

  re.tpl_email_fuzzy =
    '(^|' + text_separators + '|"|\\(|' + re.src_ZCc + ')' +
    '(' + re.src_email_name + '@' + re.tpl_host_fuzzy_strict + ')';

  re.tpl_link_fuzzy =
    '(^|(?![.:/\\-_@])(?:[$+<=>^`|\uff5c]|' + re.src_ZPCc + '))' +
    '((?![$+<=>^`|\uff5c])' + re.tpl_host_port_fuzzy_strict + re.src_path + ')';

  re.tpl_link_no_ip_fuzzy =
    '(^|(?![.:/\\-_@])(?:[$+<=>^`|\uff5c]|' + re.src_ZPCc + '))' +
    '((?![$+<=>^`|\uff5c])' + re.tpl_host_port_no_ip_fuzzy_strict + re.src_path + ')';

};

I don't think that's maintainable on our codebase.

I actually see couple of options here:

  1. Merge https://github.com/markdown-it/linkify-it/pull/96 which adds test coverage for these cases and fixes the issue.
  2. Make this library extendable/configurable in a better way, which doesn't include having half of regexps codebase on consumer side, maintaining backwards compatibility.

Please make some kind of decision, as doing nothing and ignoring OS community issues for years is not a valid solution.