@vsoch Once again, here is another attempt at improving our long and forgiving nice regex :smile:
A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was.
Complexity, simplicity and regex visualizations
Here is a simplified (domain extensions are replaced with ... except .com and .org) graph of what we have at the moment:
So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks:
Comparing efficiency and speed
Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1
Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally:
import re
import time
domain_extensions = "".join(
(
"com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|",
"jobs|mobi|museum|name|post|pro|tel|travel|xxx|",
"ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|",
"ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|",
"ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|",
"dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|",
"fi|fj|fk|fm|fo|fr|",
"ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|",
"hk|hm|hn|hr|ht|hu|",
"id|ie|il|im|in|io|iq|ir|is|it|",
"je|jm|jo|jp|ke|kg|kh|ki|",
"km|kn|kp|kr|kw|ky|kz|",
"la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|",
"ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|",
"na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|",
"om|",
"pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|",
"qa|",
"re|ro|rs|ru|rw|",
"sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|",
"tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|",
"ua|ug|uk|us|uy|uz|",
"va|vc|ve|vg|vi|vn|vu|",
"wf|ws|",
"ye|yt|yu|",
"za|zm|zw",
)
)
URL_REGEX1 = "".join(
(
"(?i)\\b(",
"(?:",
"https?:(?:htt/{1,3}|[a-z0-9%]",
")",
"|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
"(?:",
"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
"|\\([^\\s]+?\\)",
")",
"+",
"(?:",
"\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
"|\\([^\\s]+?\\)",
"|[^\\s`!()\\[\\];:'\".,<>?«»“”‘’]",
")",
"|",
"(?:",
"(?<!@)[a-z0-9]",
"+(?:[.\\-][a-z0-9]+)*[.]",
"(?:%s)\\b/?(?!@)" % domain_extensions,
"))",
)
)
CURRENT_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
NEW_REGEX = "(http[s]?:\/\/)(www\.)?([a-zA-Z0-9$-_@&+!*\(\),\/\.]+[\.])([a-zA-Z]+)([\/\.\-\_\=?#a-zA-Z0-9@&_=:%+~\(\)]+)"
# read file content
file_path = "links.txt"
with open(file_path, "r") as file:
content = file.read()
links = [ l for l in content.split("\n") if "http" in l ]
# 1st regex
t01 = time.time()
for i in range(1000):
urls0 = re.findall(URL_REGEX1, content)
t02 = time.time()
print("DT0 =", t02-t01)
print("LEN0 = ", len(urls0))
# final regex
t11 = time.time()
for i in range(1000):
urls1 = re.findall(CURRENT_REGEX, content)
t12 = time.time()
print("DT1 =", t12-t11)
print("LEN1 = ", len(urls1))
# 2nd regex
t21 = time.time()
for i in range(1000):
urls2 = ["".join(x) for x in re.findall(NEW_REGEX, content)]
t22 = time.time()
print("DT2 =", t22-t21)
print("LEN2 = ", len(urls2))
links.txt is a file with 755 urls, each on a seperate line. These urls are collected from the logs of buildtest and us-rse. The results of the previous comparison are the following:
As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them.
So what's next?
I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR :wink: This blog post: In search of the perfect URL validation regex is a good inspiration. I think we rank somewhat third according to their test.
@vsoch Once again, here is another attempt at improving our long and forgiving nice regex :smile:
A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was.
Complexity, simplicity and regex visualizations
Here is a simplified (domain extensions are replaced with ... except .com and .org) graph of what we have at the moment:
So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks:
Comparing efficiency and speed
Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1 Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally:
links.txt
is a file with 755 urls, each on a seperate line. These urls are collected from the logs of buildtest and us-rse. The results of the previous comparison are the following:As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them.
So what's next?
I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR :wink: This blog post: In search of the perfect URL validation regex is a good inspiration. I think we rank somewhat third according to their test.