Fixes RE for IPv4 addresses

lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

MIT License

242 stars 61 forks source link

Fixes RE for IPv4 addresses #86

Closed kak-bo-che closed 3 years ago

kak-bo-che commented 3 years ago

When the range of IPs are re.compiled the regex isn't greedy enough

>>> import re
>>> _ipv4_tld = ['.{}'.format(ip) for ip in reversed(range(256))]
>>> foo = re.compile('|'.join(_ipv4_tld))
>>> foo.findall('.81')
['.81']
>>> _ipv4_tld = ['.{}'.format(ip) for ip in range(256)]
>>> foo = re.compile('|'.join(_ipv4_tld))
>>> foo.findall('.81')
['.8']

lipoja commented 3 years ago

Thank you @kak-bo-che for this PR and your time spend on it! I really appreciate it.

And if you have any other ideas or improvement feel free to create issue or PR! :) Have a nice day. Jan