tasos-py / Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python
MIT License
513 stars 137 forks source link

Requests => Aiohttp #37

Closed soxoj closed 2 years ago

soxoj commented 2 years ago

Hey! Thank you for your work!

I want to use your library to do multiple engines search simultaneously, but requests requests are blocking, and cannot be executed concurrently with the rest of my code.

I have made a relatively simple update of the code - changed requests calls to aiohttp calls and made some minor necessary updates.

Hope these changes are not too big and they do not contradict your vision of the tool development.

tasos-py commented 2 years ago

Thanks for your contribution, I can see that the work you've done here is very impressive. I like the idea of asynchronous requests, but the changes you've made are quite significant and I have to study them carefully. And since I'm not at all familiar with aiohttp please give me some time to think about your pull request.

tasos-py commented 2 years ago

So, I took a peek at your version and I already have several questions.

First, I run into a "AttributeError: 'ClientSession' object has no attribute 'headers'" error at https://github.com/soxoj/Search-Engines-Scraper/blob/6cd086e845c34c0568e45e7457297476e46f32de/search_engines/http_client.py#L19. As a temporary fix, I'm passing headers to the constructor in HttpClient() and by using ClientSession._default_headers = CIMultiDict(headers) in SearchEngine.set_headers(). Apparently, .headers doean't exist or isn't accesible in some versions of aiohttp - I'm using aiohttp3.5.1, python3.7, windows.

Next, I noticed you're catching Except in HttpClient, which is generally not a good idea. For example, I forgot to comment out self.session.headers['Referer'] = page in HttpClient.get() and the only error indication I got was "ERROR Attribute not found.". I think it would be best to use ClientConnectionError, ClientConnectorError and possibly a couple more. I'm not familiar with aiohttp, what do you think?

I noticed that you left this https://github.com/soxoj/Search-Engines-Scraper/blob/6cd086e845c34c0568e45e7457297476e46f32de/search_engines/engine.py#L182. I believe await and time.sleep() don't go together, shouldn't we use asyncio.sleep() instead? And since we're at this topic, why would you set the delay to zero by default? this would produce a lot of bans by most search engines.

Finally, I believe aiohttp >= v3.0 doesn't support python2.7 and some early python3 versions. I know python2.7 is deprecated, but it's still being used and I'd like to offer maximum compatibility and usability.

Then there are some mostly aesthetic issues. I don't like the SearchEngine.close() method, it makes no sense. I'd prefer ClientSession.close() to be called internally, somehow. First I thought of __del__(), but it's usually a bad idea. I think a better option is to implement __enter__() and __exit__() and have something like

with SearchEngine() as eng:
    results = eng.search('query')

But that's not a real solution, we'd still have to call .close() (maybe rename to "finish" or something similar) when not using a context manager. Any ideas?

In SearchEngine() you added a default print_func argument, but in its children you added args+kwargs. Why is that? Why not have print_func= or *args, **kwargs in both parent and children? Also, I don't really see the need for that, since we could just "monkey-patch" output.console(), like

from search_engines import output
output.console = lambda *args, **kwargs: None

Or a more formal option would be to add a boolean .suppress_console_output attribute, if the goal is to remove any prints.

That's all for now, let me dig a little deeper and I'll come back to talk more.

soxoj commented 2 years ago

First of all sorry, I made some new commits to my master while I didn't want it was present in the PR (forgot how it works, huh). But let's discuss these updates too, cause I believe they can be useful not for me.

First, I run into a "AttributeError: 'ClientSession' object has no attribute 'headers'" error

You are right, my bad. Of course, this behaviour should be changed.

Next, I noticed you're catching Except

I tried to catch a general ClientError the first time, but later I've got some other error with proxies that are not descendants of basic aiohttp request error. We can just add more than one except block for each type, but I am afraid of some errors that will not be caught.

shouldn't we use asyncio.sleep() instead

You are totally right, I've forgotten to fix that.

why would you set the delay to zero by default?

Just for debugging purposes, sorry once more about later commits. I think it will be better to make a separate timeout optional argument in the constructor.

I don't like the SearchEngine.close() method

Well, as far as I know, it is a usual thing in a python async code. But, of course, it will be better to implement __enter__() and __exit__() to let users decide what code style to use.

Why not have print_func= or *args, **kwargs in both parent and children?

I agree, the second way will be more decent. I believe that should be an easy way to change the default status function for such complex processes as a real-time search with pagination, but suppress_console_output argument is also a good idea.

tasos-py commented 2 years ago

Alright, I think you covered about everything. For start, let's focus on the two first issues, that are the most important.

I want to stress the importance of specific exception handling, we could get really strange bugs otherwise. We can just catch multiple exceptions in one except block, if we are to handle them the same way (ie return 0 code for all types of connection errors). If something we didn't account for happens, just let it crash - we wouldn't know how to handle it anyway.

The headers thing is a little more complex, at least for me. I'm not sure if the ClientSession._default_headers I used is a good option, I have to do some research and debugging here. Maybe we could re-instantiate the http client in .set_headers(), but that's not an option in the client's .get()/.post() methods, where we set "Referer". Also, it would be best to update existing headers, not replace them completely.

soxoj commented 2 years ago

I've made updates. Please, review, when you have a moment.

tasos-py commented 2 years ago

Your updates look great, but I haven't tested them because I'm working on the HTTP client at the moment. I think I found a better solution to the headers problem - replace self.session.headers with a ._headers dict and add a .set_headers() that updates it, then pass headers=self._headers to self.session's get()/post() methods. I still haven't found what are the best exceptions to catch though. ClientError is the best so far, but it doesn't catch proxy connection errors.

soxoj commented 2 years ago

Well, so I'll be waiting for a final list of things that need to be fixed from you.

tasos-py commented 2 years ago

I see you updated HttpClient and fixed the headers issue. It's very close to what I had in mind, the only difference is that I was thinking of a private ._headers with a .set_headers(). But this is perfectly fine, no need to complicate things.

However, aiohttp.client_exception.ClientError gives me "AttributeError: module 'aiohttp' has no attribute 'client_exception'". In case it's just a typo, aiohttp.ClientError catches most aiohttp exceptions, since they're besed on it, but it doesn't catch proxy erors. I was able to catch proxy errors with aiohttp_socks's exceptions, and this is what I have so far,

import aiohttp
from collections import namedtuple
from aiohttp_socks import ProxyConnector, ProxyError, ProxyConnectionError
from asyncio import TimeoutError

from .config import TIMEOUT, PROXY, USER_AGENT
from . import utils as utl

class HttpClient(object):
    '''Performs HTTP requests. A `aiohttp` wrapper, essentialy'''
    def __init__(self, timeout=TIMEOUT, proxy=PROXY):
        if proxy:
            connector = ProxyConnector.from_url(proxy)
            self.session = aiohttp.ClientSession(connector=connector)
        else:
            self.session = aiohttp.ClientSession()

        self.headers = {
            'User-Agent': USER_AGENT,
            'Accept-Language': 'en-GB,en;q=0.5',
        }

        self.timeout = timeout
        self.response = namedtuple('response', ['http', 'html'])

    async def close(self):
        await self.session.close()

    async def get(self, page):
        '''Submits a HTTP GET request.'''
        page = self._quote(page)
        try:
            req = await self.session.get(page, headers=self.headers, timeout=self.timeout)
            text = await req.text()
            self.headers['Referer'] = page
        except (aiohttp.ClientError, ProxyError, ProxyConnectionError) as e:
            return self.response(http=getattr(e, 'error_code', 0), html=str(e))
        except TimeoutError as e:
            return self.response(http=0, html='timeout')
        return self.response(http=req.status, html=text)

    async def post(self, page, data):
        '''Submits a HTTP POST request.'''
        page = self._quote(page)
        try:
            req = await self.session.post(page, data=data, headers=self.headers, timeout=self.timeout)
            text = await req.text()
            self.headers['Referer'] = page
        except (aiohttp.ClientError, ProxyError, ProxyConnectionError) as e:
            return self.response(http=getattr(e, 'error_code', 0), html=str(e))
        except TimeoutError as e:
            return self.response(http=0, html='timeout')
        return self.response(http=req.status, html=text)

    def _quote(self, url):
        '''URL-encodes URLs.'''
        if utl.unquote_url(url) == url:
            url = utl.quote_url(url)
        return url

I've tested it in some scenarios (connection errors, timeout errors and proxy errors in general) and it seems to work well. Could you take a look, in case I missed something?

soxoj commented 2 years ago

Looks great! 👍 Will you update this part of the code by yourself?

tasos-py commented 2 years ago

I'm thinking about it. I don't know if merging is the best option, it may be better to keep the two versions separate. My main concern is limited compatibility, my version requires Pyhton>=2.7 and I think your version requires Python>=3.6. Another big issue for me is that I'm not that familiar with aiohttp, and I don't know if I'll be able to properly maintain and update this project.

I think it would be best to have two versions, one sync and one async, and let the users choose. I can put a link in my readme and point them to your async version. And you can change your fork however you want, of course. I'll be here if you need any help with bs4 or with any parts of the code I wrote. Sounds good?

soxoj commented 2 years ago

Well, I understand your concerns. I'll try to support the async version on my own. Should I rename the fork to smth like async-search-engines-scraper, how do you think?

tasos-py commented 2 years ago

Yes, it's fine. Or you can choose something more creative if you want, it's upt to you really. You could also set your own __version__ and __author__ and make any other required changes. Just let me know when it's ready, to link it in my readme.

soxoj commented 2 years ago

Hey, I renamed my fork to async-search-scraper, will update the package information soon. So, you can put a link in your readme and point to the async version.

tasos-py commented 2 years ago

Thanks, I added it to my readme. Just let me know if you want to change the URL in the future.