lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

check dns concurrently to speed up lookup #91

Open nicolasassi opened 3 years ago

nicolasassi commented 3 years ago

Is there any room to check dns concurrently as it might be a time consuming task? I've found some hacky ways to do that, but maybe it could be a flag on find_urls. I'm aware this would increase the complexity of this function a lot.

lipoja commented 3 years ago

I would like to invite @jayvdb to this conversation, since he is the one who implemented DNS checking.

nicolasassi commented 3 years ago

for reference, this is the hacky way I was refering to:

from urlextract import URLExtract
from app.infrastructure.analyser.finder import Finder
from pebble import concurrent
from concurrent.futures import TimeoutError

class Hyperlinks(Finder):

    def __init__(self):
        self.extractor = URLExtract()

    @concurrent.process(timeout=2)
    def _check_dns(self, text: str):
        return self.extractor.has_urls(text, check_dns=True)

    def find(self, text: str) -> list:
        extractor = URLExtract()
        urls = extractor.find_urls(text, only_unique=True, check_dns=False)
        processes = list()
        for indx, url in enumerate(urls):
            processes.append((indx, self._check_dns(url)))
        for process in processes:
            try:
                if not process[1].result():
                    urls.pop(process[0])
            except TimeoutError:
                urls.pop(process[0])
        return urls

f = Hyperlinks()
f.find("https://github.com/lipoja/URLExtract/issues/91", "https://github.com/lipoja/URLExtract/issues/90")
jayvdb commented 3 years ago

Sounds like a great idea, and should complement caching nicely. Also pebble seems like a good choice.

nicolasassi commented 3 years ago

I was studying the code to find a way to achive concurrency in the dns lookup. But I kind got stuck because I seems that there's a cursor in _complete_url which depends on the result of each interation before going to a new possible URL. So it would need to be a synchronous process. What do you guys think?

lipoja commented 3 years ago

I believe synchronous process for checking domain is already in place. That is the current behavior, right?

I am thinking about keep checking DNS the way it is now for gen_urls(). I mean: when user calls this directly with check_dns enabled. We do not modify this part and every domain is check right after it is discovered (completed).

We could modify find_urls(). When user call this method it would collect all URLs first and do the concurrent DNS check as a last step of this method (just before return). At that point we should have all URLs ready for this kind of check.

lipoja commented 3 years ago

Second thing is that this brings complexity to user settings. We should allow users to set maximum number of concurrent processes/threads. Probably also set of timeout value. And maybe other things that I am not aware of right now.

nicolasassi commented 3 years ago

Great! I was thinking of adding concurrency in the gen_urls() and give user option from the functions above it. Achieve concurrency in the way you metioned would be a lot easier. Gonna work on that.

lipoja commented 3 years ago

OK, could you also think about how to do it so we have the code not tangled. For example: If we can separate all DNS checking parts to its own class/module? This is just and idea I did not look to the code how tricky it will be. Also it would be nice to give user a method that he/she can call when they wants. For example when they use the gen_urls() or they do some post process filtering of URLs.