Closed soxoj closed 2 years ago
Thanks for your contribution, I can see that the work you've done here is very impressive. I like the idea of asynchronous requests, but the changes you've made are quite significant and I have to study them carefully. And since I'm not at all familiar with aiohttp
please give me some time to think about your pull request.
So, I took a peek at your version and I already have several questions.
First, I run into a "AttributeError: 'ClientSession' object has no attribute 'headers'" error at https://github.com/soxoj/Search-Engines-Scraper/blob/6cd086e845c34c0568e45e7457297476e46f32de/search_engines/http_client.py#L19. As a temporary fix, I'm passing headers to the constructor in HttpClient()
and by using ClientSession._default_headers = CIMultiDict(headers)
in SearchEngine.set_headers()
. Apparently, .headers
doean't exist or isn't accesible in some versions of aiohttp - I'm using aiohttp3.5.1, python3.7, windows.
Next, I noticed you're catching Except
in HttpClient
, which is generally not a good idea. For example, I forgot to comment out self.session.headers['Referer'] = page
in HttpClient.get()
and the only error indication I got was "ERROR Attribute not found.". I think it would be best to use ClientConnectionError
, ClientConnectorError
and possibly a couple more. I'm not familiar with aiohttp, what do you think?
I noticed that you left this https://github.com/soxoj/Search-Engines-Scraper/blob/6cd086e845c34c0568e45e7457297476e46f32de/search_engines/engine.py#L182. I believe await
and time.sleep()
don't go together, shouldn't we use asyncio.sleep()
instead? And since we're at this topic, why would you set the delay to zero by default? this would produce a lot of bans by most search engines.
Finally, I believe aiohttp >= v3.0 doesn't support python2.7 and some early python3 versions. I know python2.7 is deprecated, but it's still being used and I'd like to offer maximum compatibility and usability.
Then there are some mostly aesthetic issues. I don't like the SearchEngine.close()
method, it makes no sense. I'd prefer ClientSession.close()
to be called internally, somehow. First I thought of __del__()
, but it's usually a bad idea. I think a better option is to implement __enter__()
and __exit__()
and have something like
with SearchEngine() as eng:
results = eng.search('query')
But that's not a real solution, we'd still have to call .close()
(maybe rename to "finish" or something similar) when not using a context manager. Any ideas?
In SearchEngine()
you added a default print_func
argument, but in its children you added args+kwargs. Why is that? Why not have print_func=
or *args, **kwargs
in both parent and children? Also, I don't really see the need for that, since we could just "monkey-patch" output.console()
, like
from search_engines import output
output.console = lambda *args, **kwargs: None
Or a more formal option would be to add a boolean .suppress_console_output
attribute, if the goal is to remove any prints.
That's all for now, let me dig a little deeper and I'll come back to talk more.
First of all sorry, I made some new commits to my master while I didn't want it was present in the PR (forgot how it works, huh). But let's discuss these updates too, cause I believe they can be useful not for me.
First, I run into a "AttributeError: 'ClientSession' object has no attribute 'headers'" error
You are right, my bad. Of course, this behaviour should be changed.
Next, I noticed you're catching Except
I tried to catch a general ClientError the first time, but later I've got some other error with proxies that are not descendants of basic aiohttp request error. We can just add more than one except block for each type, but I am afraid of some errors that will not be caught.
shouldn't we use asyncio.sleep() instead
You are totally right, I've forgotten to fix that.
why would you set the delay to zero by default?
Just for debugging purposes, sorry once more about later commits. I think it will be better to make a separate timeout optional argument in the constructor.
I don't like the SearchEngine.close() method
Well, as far as I know, it is a usual thing in a python async code. But, of course, it will be better to implement __enter__()
and __exit__()
to let users decide what code style to use.
Why not have
print_func=
or*args, **kwargs
in both parent and children?
I agree, the second way will be more decent. I believe that should be an easy way to change the default status function for such complex processes as a real-time search with pagination, but suppress_console_output
argument is also a good idea.
Alright, I think you covered about everything. For start, let's focus on the two first issues, that are the most important.
I want to stress the importance of specific exception handling, we could get really strange bugs otherwise. We can just catch multiple exceptions in one except
block, if we are to handle them the same way (ie return 0 code for all types of connection errors). If something we didn't account for happens, just let it crash - we wouldn't know how to handle it anyway.
The headers thing is a little more complex, at least for me. I'm not sure if the ClientSession._default_headers
I used is a good option, I have to do some research and debugging here. Maybe we could re-instantiate the http client in .set_headers()
, but that's not an option in the client's .get()
/.post()
methods, where we set "Referer". Also, it would be best to update existing headers, not replace them completely.
I've made updates. Please, review, when you have a moment.
Your updates look great, but I haven't tested them because I'm working on the HTTP client at the moment. I think I found a better solution to the headers problem - replace self.session.headers
with a ._headers
dict and add a .set_headers()
that updates it, then pass headers=self._headers
to self.session
's get()
/post()
methods. I still haven't found what are the best exceptions to catch though. ClientError
is the best so far, but it doesn't catch proxy connection errors.
Well, so I'll be waiting for a final list of things that need to be fixed from you.
I see you updated HttpClient
and fixed the headers issue. It's very close to what I had in mind, the only difference is that I was thinking of a private ._headers
with a .set_headers()
. But this is perfectly fine, no need to complicate things.
However, aiohttp.client_exception.ClientError
gives me "AttributeError: module 'aiohttp' has no attribute 'client_exception'". In case it's just a typo, aiohttp.ClientError
catches most aiohttp exceptions, since they're besed on it, but it doesn't catch proxy erors. I was able to catch proxy errors with aiohttp_socks
's exceptions, and this is what I have so far,
import aiohttp
from collections import namedtuple
from aiohttp_socks import ProxyConnector, ProxyError, ProxyConnectionError
from asyncio import TimeoutError
from .config import TIMEOUT, PROXY, USER_AGENT
from . import utils as utl
class HttpClient(object):
'''Performs HTTP requests. A `aiohttp` wrapper, essentialy'''
def __init__(self, timeout=TIMEOUT, proxy=PROXY):
if proxy:
connector = ProxyConnector.from_url(proxy)
self.session = aiohttp.ClientSession(connector=connector)
else:
self.session = aiohttp.ClientSession()
self.headers = {
'User-Agent': USER_AGENT,
'Accept-Language': 'en-GB,en;q=0.5',
}
self.timeout = timeout
self.response = namedtuple('response', ['http', 'html'])
async def close(self):
await self.session.close()
async def get(self, page):
'''Submits a HTTP GET request.'''
page = self._quote(page)
try:
req = await self.session.get(page, headers=self.headers, timeout=self.timeout)
text = await req.text()
self.headers['Referer'] = page
except (aiohttp.ClientError, ProxyError, ProxyConnectionError) as e:
return self.response(http=getattr(e, 'error_code', 0), html=str(e))
except TimeoutError as e:
return self.response(http=0, html='timeout')
return self.response(http=req.status, html=text)
async def post(self, page, data):
'''Submits a HTTP POST request.'''
page = self._quote(page)
try:
req = await self.session.post(page, data=data, headers=self.headers, timeout=self.timeout)
text = await req.text()
self.headers['Referer'] = page
except (aiohttp.ClientError, ProxyError, ProxyConnectionError) as e:
return self.response(http=getattr(e, 'error_code', 0), html=str(e))
except TimeoutError as e:
return self.response(http=0, html='timeout')
return self.response(http=req.status, html=text)
def _quote(self, url):
'''URL-encodes URLs.'''
if utl.unquote_url(url) == url:
url = utl.quote_url(url)
return url
I've tested it in some scenarios (connection errors, timeout errors and proxy errors in general) and it seems to work well. Could you take a look, in case I missed something?
Looks great! 👍 Will you update this part of the code by yourself?
I'm thinking about it. I don't know if merging is the best option, it may be better to keep the two versions separate. My main concern is limited compatibility, my version requires Pyhton>=2.7 and I think your version requires Python>=3.6. Another big issue for me is that I'm not that familiar with aiohttp, and I don't know if I'll be able to properly maintain and update this project.
I think it would be best to have two versions, one sync and one async, and let the users choose. I can put a link in my readme and point them to your async version. And you can change your fork however you want, of course. I'll be here if you need any help with bs4 or with any parts of the code I wrote. Sounds good?
Well, I understand your concerns. I'll try to support the async version on my own. Should I rename the fork to smth like async-search-engines-scraper
, how do you think?
Yes, it's fine. Or you can choose something more creative if you want, it's upt to you really. You could also set your own __version__
and __author__
and make any other required changes. Just let me know when it's ready, to link it in my readme.
Hey, I renamed my fork to async-search-scraper, will update the package information soon. So, you can put a link in your readme and point to the async version.
Thanks, I added it to my readme. Just let me know if you want to change the URL in the future.
Hey! Thank you for your work!
I want to use your library to do multiple engines search simultaneously, but
requests
requests are blocking, and cannot be executed concurrently with the rest of my code.I have made a relatively simple update of the code - changed
requests
calls toaiohttp
calls and made some minor necessary updates.Hope these changes are not too big and they do not contradict your vision of the tool development.