Open nicolasassi opened 3 years ago
I'm not really sure how depedencies work on this project, but I tried to add Pebble both to setup.py and to requirements.txt. Hope it fix the problem
You should be able to run tox
locally to test if everything works.
requirements.txt
should be used to install all dependencies for development
setup.py
should contain all basic dependencies necessary to run urlextract
Thank you for your time and effort working on this issue. I really appreciate that! Let's have a discussion about this code and I believe that withing few iteration we can merge the code.
I did not had chance to review it all, I want to go deeper once I have more time.
FYI: Do you think we could also add some test for it? I did not thought about it much yet. But I think there should be some.
@lipoja I fixed most of what was asked. As some changes are still in discussion (like i find_urls
should have some many args etc...) there are still changes to be made. Also, the tests are not passing.
For some reason with my alterations now the results of dns checking are not being saved in the cache (that's way the tests are not passing now) and I can't figure out why. Could you please give it a look?
Thanks!
@nicolasassi Sorry for the delay, lack of time it is ... family comes first these days.
What I was able to determine is that your second commit is breaking the tests. So maybe we can dig deeper around that. It is about moving cache to one function. But at the first glance I do not see anything wrong. It needs more debugging.
I will look on when I save some time (might be on weekend, but no promises).
@nicolasassi OK I found it and fixed it:
Method socket.gethostbyname(host)
accept hosts ... unfortunately there should be some pre-processing since you are working with URLs and not with hosts only. You have to modify _get_hots()
to this:
def _get_host(self, host: str):
"""
Get the IP address from a given host
:param str host: the host to get IP from
:return: A tuple with the given host and its IP address (a string of the form '255.255.255.255') if found
(e.g: host.com, '255.255.255.255')
:rtype: tuple
"""
tmp_url = host
scheme_pos = host.find('://')
if scheme_pos == -1:
tmp_url = 'http://' + host
url_parts = uritools.urisplit(tmp_url)
tmp_host = url_parts.gethost()
if isinstance(tmp_host, ipaddress.IPv4Address):
return host, tmp_host
try:
return host, socket.gethostbyname(tmp_host)
...
Thank you for your work, it looks like the parallel processing might work well... I want to review and test it more once this fix is in place.
@lipoja sorry for the delay I'm also kind busy these days. As soon as I have some time I'm gonna implement the changes you've suggested and add more tests. Thank you for your time!
@nicolasassi Do you think I can take this over and finish that PR in case I have some time?
@lipoja sure! Life has been crazy and unfortunatelly time is sort... I still hope I can take some time to focus on this project and finish it, but just in case, if you have some time, feel free to take over.
Hope my contribuition already helped somehow and feel free to @ me if you need some help on this or anything in the future
implemented ideas discussed in #91.
Also moved all dns checking for
find_urls
andhas_urls
so all found urls could be check concurrently if the user needs. I kept all intances of dns checking in the abstract methods for backwards compatibility but marked them as DEPRECATED and removed its effects. Maybe we could remove it all together in the future.