urlstechie / urlchecker-python

:snake: :link: Python module and client for checking URLs
https://urlchecker-python.readthedocs.io
MIT License
20 stars 13 forks source link

Improve urls regex #70

Open SuperKogito opened 2 years ago

SuperKogito commented 2 years ago

@vsoch Once again, here is another attempt at improving our long and forgiving nice regex :smile:

A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was.

Complexity, simplicity and regex visualizations

Here is a simplified (domain extensions are replaced with ... except .com and .org) graph of what we have at the moment: image

So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks: image(2)

Comparing efficiency and speed

Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1 Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally:

import re 
import time 

domain_extensions = "".join(
    (
        "com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|",
        "jobs|mobi|museum|name|post|pro|tel|travel|xxx|",
        "ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|",
        "ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|",
        "ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|",
        "dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|",
        "fi|fj|fk|fm|fo|fr|",
        "ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|",
        "hk|hm|hn|hr|ht|hu|",
        "id|ie|il|im|in|io|iq|ir|is|it|",
        "je|jm|jo|jp|ke|kg|kh|ki|",
        "km|kn|kp|kr|kw|ky|kz|",
        "la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|",
        "ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|",
        "na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|",
        "om|",
        "pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|",
        "qa|",
        "re|ro|rs|ru|rw|",
        "sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|",
        "tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|",
        "ua|ug|uk|us|uy|uz|",
        "va|vc|ve|vg|vi|vn|vu|",
        "wf|ws|",
        "ye|yt|yu|",
        "za|zm|zw",
    )
)

URL_REGEX1 = "".join(
    (
        "(?i)\\b(",
        "(?:",
        "https?:(?:htt/{1,3}|[a-z0-9%]",
        ")",
        "|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
        "(?:",
        "[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        ")",
        "+",
        "(?:",
        "\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        "|[^\\s`!()\\[\\];:'\".,<>?«»“”‘’]",
        ")",
        "|",
        "(?:",
        "(?<!@)[a-z0-9]",
        "+(?:[.\\-][a-z0-9]+)*[.]",
        "(?:%s)\\b/?(?!@)" % domain_extensions,
        "))",
    )
)
CURRENT_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
NEW_REGEX = "(http[s]?:\/\/)(www\.)?([a-zA-Z0-9$-_@&+!*\(\),\/\.]+[\.])([a-zA-Z]+)([\/\.\-\_\=?#a-zA-Z0-9@&_=:%+~\(\)]+)"

# read file content
file_path = "links.txt"
with open(file_path, "r") as file:
    content = file.read()

links  = [ l for l in content.split("\n") if "http" in l ]

# 1st regex
t01 = time.time()
for i in range(1000):
    urls0 = re.findall(URL_REGEX1, content)
t02 = time.time()
print("DT0  =", t02-t01)
print("LEN0 = ", len(urls0))

# final regex
t11 = time.time()
for i in range(1000):
    urls1 = re.findall(CURRENT_REGEX, content)
t12 = time.time()
print("DT1  =", t12-t11)
print("LEN1 = ", len(urls1))

# 2nd regex
t21   = time.time()
for i in range(1000):
    urls2 = ["".join(x) for x in re.findall(NEW_REGEX, content)]
t22   = time.time()
print("DT2  =", t22-t21)
print("LEN2 = ", len(urls2))

links.txt is a file with 755 urls, each on a seperate line. These urls are collected from the logs of buildtest and us-rse. The results of the previous comparison are the following:

DT0  = 2.3765275478363037
LEN0 =  748

DT1  = 0.7541322708129883
LEN1 =  755

DT2  = 0.6342747211456299
LEN2 =  755

As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them.

So what's next?

I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR :wink: This blog post: In search of the perfect URL validation regex is a good inspiration. I think we rank somewhat third according to their test.

vsoch commented 2 years ago

@SuperKogito I agree - and the proof would be in the pudding! If you want to open a PR for this, I can release a development action and test on USRSE.

SuperKogito commented 2 years ago

okay sounds good to me. Let us first merge #69.