lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
244 stars 61 forks source link

move dns checking to dedicated class and add concurrency #92

Open nicolasassi opened 3 years ago

nicolasassi commented 3 years ago

implemented ideas discussed in #91.

Also moved all dns checking for find_urls and has_urls so all found urls could be check concurrently if the user needs. I kept all intances of dns checking in the abstract methods for backwards compatibility but marked them as DEPRECATED and removed its effects. Maybe we could remove it all together in the future.

nicolasassi commented 3 years ago

I'm not really sure how depedencies work on this project, but I tried to add Pebble both to setup.py and to requirements.txt. Hope it fix the problem

lipoja commented 3 years ago

You should be able to run tox locally to test if everything works. requirements.txt should be used to install all dependencies for development setup.py should contain all basic dependencies necessary to run urlextract

lipoja commented 3 years ago

Thank you for your time and effort working on this issue. I really appreciate that! Let's have a discussion about this code and I believe that withing few iteration we can merge the code.

I did not had chance to review it all, I want to go deeper once I have more time.

FYI: Do you think we could also add some test for it? I did not thought about it much yet. But I think there should be some.

nicolasassi commented 3 years ago

@lipoja I fixed most of what was asked. As some changes are still in discussion (like i find_urls should have some many args etc...) there are still changes to be made. Also, the tests are not passing.

For some reason with my alterations now the results of dns checking are not being saved in the cache (that's way the tests are not passing now) and I can't figure out why. Could you please give it a look?

Thanks!

lipoja commented 3 years ago

@nicolasassi Sorry for the delay, lack of time it is ... family comes first these days.

What I was able to determine is that your second commit is breaking the tests. So maybe we can dig deeper around that. It is about moving cache to one function. But at the first glance I do not see anything wrong. It needs more debugging.

I will look on when I save some time (might be on weekend, but no promises).

lipoja commented 3 years ago

@nicolasassi OK I found it and fixed it: Method socket.gethostbyname(host) accept hosts ... unfortunately there should be some pre-processing since you are working with URLs and not with hosts only. You have to modify _get_hots() to this:

    def _get_host(self, host: str):
        """
         Get the IP address from a given host
        :param str host: the host to get IP from
        :return: A tuple with the given host and its IP address (a string of the form '255.255.255.255') if found
        (e.g: host.com, '255.255.255.255')
        :rtype: tuple
        """
        tmp_url = host
        scheme_pos = host.find('://')
        if scheme_pos == -1:
            tmp_url = 'http://' + host

        url_parts = uritools.urisplit(tmp_url)
        tmp_host = url_parts.gethost()

        if isinstance(tmp_host, ipaddress.IPv4Address):
            return host, tmp_host

        try:
            return host, socket.gethostbyname(tmp_host)
        ...

Thank you for your work, it looks like the parallel processing might work well... I want to review and test it more once this fix is in place.

nicolasassi commented 3 years ago

@lipoja sorry for the delay I'm also kind busy these days. As soon as I have some time I'm gonna implement the changes you've suggested and add more tests. Thank you for your time!

lipoja commented 3 years ago

@nicolasassi Do you think I can take this over and finish that PR in case I have some time?

nicolasassi commented 3 years ago

@lipoja sure! Life has been crazy and unfortunatelly time is sort... I still hope I can take some time to focus on this project and finish it, but just in case, if you have some time, feel free to take over.

Hope my contribuition already helped somehow and feel free to @ me if you need some help on this or anything in the future