officialpm / scrape-amazon

🤩 Python Package for Scraping Amazon Product Reviews ✨
https://pypi.org/project/scrape-amazon
MIT License
33 stars 11 forks source link

OSError: [Errno 101] Network is unreachable #10

Open devteam1-spark6 opened 1 year ago

devteam1-spark6 commented 1 year ago

My code

from scrape_amazon import get_reviews

reviews = get_reviews('in','B09RMG1M98')

Error log :

[INFO] Scraping Reviews of Amazon ProductID - B09RMG1M98 [scrape-amazon] - Amazon.in:Customer reviews: realme narzo 50 (Speed Black, 4GB RAM+64GB Storage) Helio G96 Processor | 50MP AI Triple Camera | 120Hz Ultra Smooth Display [scrape-amazon] Total Pages - 78 [scrape-amazon] Total Reviews - 773

71%|██████████████████████████████████████████████████ | 55/78 [02:21<00:58, 2.56s/it] multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) OSError: [Errno 101] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connection.py", line 358, in connect self.sock = conn = self._new_conn() File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f7c8b9c1780>: Failed to establish a new connection: [Errno 101] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/connectionpool.py", line 788, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/tops/environments/tp_env/lib/python3.6/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.amazon.in', port=443): Max retries exceeded with url: /dp/product-reviews/B09RMG1M98?pageNumber=56 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7c8b9c1780>: Failed to establish a new connection: [Errno 101] Network is unreachable',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tops/environments/tp_env/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/pathos/helpers/mp_helper.py", line 15, in func = lambda args: f(args) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/scrape_amazon/util/scrape.py", line 27, in extractPage r = get_URL(url) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/scrape_amazon/util/urlFunctions.py", line 30, in get_URL content: str = requests.get(url, headers={"User-Agent": user_agent}) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/api.py", line 75, in get return request('get', url, params=params, kwargs) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, kwargs) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.amazon.in', port=443): Max retries exceeded with url: /dp/product-reviews/B09RMG1M98?pageNumber=56 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7c8b9c1780>: Failed to establish a new connection: [Errno 101] Network is unreachable',)) """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/tops/tp/scrapping.py", line 3, in reviews = get_reviews('in','B09RMG1M98') File "/home/tops/environments/tp_env/lib/python3.6/site-packages/scrape_amazon/scraper.py", line 17, in get_reviews return scrape_reviews(all_reviews_url, domain) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/scrape_amazon/util/scrape.py", line 132, in scrape_reviews results = p_map(extractPage, urlsToFetch) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/p_tqdm/p_tqdm.py", line 65, in p_map result = list(generator) File "/home/tops/environments/tp_env/lib/python3.6/site-packages/p_tqdm/p_tqdm.py", line 54, in _parallel for item in tqdm_func(map_func(function, *iterables), total=length, **kwargs): File "/home/tops/environments/tp_env/lib/python3.6/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/tops/environments/tp_env/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next raise value requests.exceptions.ConnectionError: None: Max retries exceeded with url: /dp/product-reviews/B09RMG1M98?pageNumber=56 (Caused by None)

My guess is that the requests are getting blocked after a certain no of consecutive attempts. Please let me know if there is a solution

batmanscode commented 1 year ago

Something like an optional time=10 parameter could be useful for this