jakevdp / PythonDataScienceHandbook

Python Data Science Handbook: full text in Jupyter Notebooks
http://jakevdp.github.io/PythonDataScienceHandbook
MIT License
42.69k stars 17.82k forks source link

Python for Data Science #366

Open Diamond-Ruby opened 1 year ago

Diamond-Ruby commented 1 year ago

Hello there! I'm trying to scrap data from the web for an analysis but the code is having error and I'm not able to fix, pls I will paste the code and the error below, can anyone help pls.

base_url = "https://www.airlinequality.com/airline-reviews/british-airways" pages = 10 page_size = 100

reviews = []

for i in range(1, pages + 1):

for i in range(1, pages + 1):

print(f"Scraping page {i}")

# Create URL to collect links from paginated data
url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

# Collect HTML data from this page
response = requests.get(url)

# Parse content
content = response.content
parsed_content = BeautifulSoup(content, 'html.parser')
for para in parsed_content.find_all("div", {"class": "text_content"}):
    reviews.append(para.get_text())

print(f"   ---> {len(reviews)} total reviews")

TimeoutError Traceback (most recent call last) ~\anaconda3\lib\site-packages\urllib3\connection.py in _new_conn(self) 173 try: --> 174 conn = connection.create_connection( 175 (self._dns_host, self.port), self.timeout, **extra_kw

~\anaconda3\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options) 94 if err is not None: ---> 95 raise err 96

~\anaconda3\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options) 84 sock.bind(source_address) ---> 85 sock.connect(sa) 86 return sock

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

NewConnectionError Traceback (most recent call last) ~\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 702 # Make the request on the httplib connection object. --> 703 httplib_response = self._make_request( 704 conn,

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 385 try: --> 386 self._validate_conn(conn) 387 except (SocketTimeout, BaseSSLError) as e:

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in _validate_conn(self, conn) 1041 if not getattr(conn, "sock", None): # AppEngine might not have .sock -> 1042 conn.connect() 1043

~\anaconda3\lib\site-packages\urllib3\connection.py in connect(self) 357 # Add certificate verification --> 358 self.sock = conn = self._new_conn() 359 hostname = self.host

~\anaconda3\lib\site-packages\urllib3\connection.py in _new_conn(self) 185 except SocketError as e: --> 186 raise NewConnectionError( 187 self, "Failed to establish a new connection: %s" % e

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002095A7CD550>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last) ~\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 488 if not chunked: --> 489 resp = conn.urlopen( 490 method=request.method,

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 786 --> 787 retries = retries.increment( 788 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

~\anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 591 if new_retry.is_exhausted(): --> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 593

MaxRetryError: HTTPSConnectionPool(host='www.airlinequality.com', port=443): Max retries exceeded with url: /airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002095A7CD550>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_7652\3242930068.py in 14 15 # Collect HTML data from this page ---> 16 response = requests.get(url) 17 18 # Parse content

~\anaconda3\lib\site-packages\requests\api.py in get(url, params, kwargs) 71 """ 72 ---> 73 return request("get", url, params=params, kwargs) 74 75

~\anaconda3\lib\site-packages\requests\api.py in request(method, url, kwargs) 57 # cases, and look like a memory leak in others. 58 with sessions.Session() as session: ---> 59 return session.request(method=method, url=url, kwargs) 60 61

~\anaconda3\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 585 } 586 send_kwargs.update(settings) --> 587 resp = self.send(prep, **send_kwargs) 588 589 return resp

~\anaconda3\lib\site-packages\requests\sessions.py in send(self, request, kwargs) 699 700 # Send the request --> 701 r = adapter.send(request, kwargs) 702 703 # Total elapsed time of the request (approximately)

~\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 563 raise SSLError(e, request=request) 564 --> 565 raise ConnectionError(e, request=request) 566 567 except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='www.airlinequality.com', port=443): Max retries exceeded with url: /airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002095A7CD550>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

nivid26 commented 1 year ago

Hi, Looks like your code is good, but there is problem with establishing connection between website and computer. Check internet connection or any firewall setting

pushpitkamboj commented 1 year ago

hey is ur problem solved brother or need help still?

SudhanAnnamalai commented 1 year ago

It is probably with error from server side

Hope this helps!

Chirag529 commented 1 year ago

Hello, You are getting TimeoutError caused by a connection attempt that didn't receive a response within a certain time period. For resolving you can:

  1. Double check the URL you are trying to access.
  2. Check your internet connection.
  3. Check for any firewall or proxy server as they might block the requests.
  4. You can use Timeout Handling and can catch the error you are getting.
  5. Try to add User-Agent as some websites treats requests without a User-Agent header as suspicious and block them.