tsutsu3 / linkify-it-py

Links recognition library with full unicode support
Other
15 stars 8 forks source link

Linkify http:foobar.com, http:/foobar.com, http:///foobar.com, etc. #15

Closed Yoric closed 2 years ago

Yoric commented 3 years ago

It would be nice if we could support all these typos.

Context: I'm using linkify-it as part of a spam/malicious links-checker and I see these typos escaping our watchful gaze :)

tsutsu3 commented 3 years ago

Hi @Yoric, thanks for using this library 😄

This project is working on a pure linkify-it port to python. There are no plans to address this issue at this time.

But can support typos by writing a custom handler:

import re

from linkify_it import LinkifyIt

linkify = LinkifyIt()

def validate(self_cls, text, pos):
    tail = text[pos:]

    self_cls.re["http"] = (
        "^(/*)"  # Allow slash typos.
        + self_cls.re["src_auth"]
        + self_cls.re["src_host_port_strict"]
        + self_cls.re["src_path"]
    )

    founds = re.search(self_cls.re["http"], tail, flags=re.IGNORECASE)
    if founds:
        return len(founds.group())

    return 0

def normalize(self_cls, match):
    if not match.schema:
        match.url = "http://" + match.url

    if match.schema == "http:":
        # Force replace typos
        match.url = re.sub("^(http:/*)", "http://", match.url)

    if match.schema == "mailto:" and not re.search(
        "^mailto:", match.url, flags=re.IGNORECASE
    ):
        match.url = "mailto:" + match.url

linkify.add("http:", {"validate": validate, "normalize": normalize})

print(linkify.test("test.com"))
print(linkify.match("test.com"))
print("-------")

print(linkify.test("http:/test.com"))  # typos
print(linkify.match("http:/test.com"))  # typos
print("-------")

print(linkify.test("http://test.com"))
print(linkify.match("http://test.com"))
print("-------")

print(linkify.test("http:///test.com"))  # typos
print(linkify.match("http:///test.com"))  # typos
print("-------")

output:

True
[linkify_it.main.Match({'schema': '', 'index': 0, 'last_index': 8, 'raw': 'test.com', 'text': 'test.com', 'url': 'http://test.com'})]
-------
True
[linkify_it.main.Match({'schema': 'http:', 'index': 0, 'last_index': 14, 'raw': 'http:/test.com', 'text': 'http:/test.com', 'url': 'http://test.com'})]
-------
True
[linkify_it.main.Match({'schema': 'http:', 'index': 0, 'last_index': 15, 'raw': 'http://test.com', 'text': 'http://test.com', 'url': 'http://test.com'})]
-------
True
[linkify_it.main.Match({'schema': 'http:', 'index': 0, 'last_index': 16, 'raw': 'http:///test.com', 'text': 'http:///test.com', 'url': 'http://test.com'})]
-------
tsutsu3 commented 2 years ago

This issue was closed because old.