Closed TonySchneider closed 4 years ago
What code you are using to initiate this process? There are several factors in hanging. One could be server delay or slow server. Second could be threading problem which can't be solved via cmd. You will have to manually join all of them to unfreeze them.
How do I can to manually unfreeze them? my code:
class ScrapingWrapper: def init(self, url, scraping_urls_path: str, scraping_url_directory: str, proxy: str = None): """ This class responsible to perform scraping of a given website or url and save the website content. content can be html, css, js etc. The class can work with proxy server.
:param url: given website to scrape :param scraping_urls_path: main or root directory to save the website's content :param scraping_url_directory: sub directory under the main to save the content :param proxy: proxy server which can be used for scraping """ logging.info(f"Processing request to scarping url {url}") self.url = url self.scraping_urls_path = scraping_urls_path self.scraping_url_directory = scraping_url_directory self.artifact_directory = self.scraping_urls_path + self.scraping_url_directory self.scraping_status = False self._config = self.__setup_config() if type(self._config) == pywebcopy.configs.ConfigHandler: self._web_page = pywebcopy.WebPage() self._session_status = self.__open_scarping_session(proxy) if self._session_status: self.scraping_status = self.__perform_scraping_all_files() logging.info(f"Scrapping has been performed for '{self.url}' and ended with result '{self.scraping_status}'") @ExceptionDecorator(exceptions=[requests.exceptions.MissingSchema]) def __setup_config(self) -> pywebcopy.configs.ConfigHandler: """ This method responsible to configured "pywebcopy" :return: """ return pywebcopy.config.setup_config(self.url, self.scraping_urls_path, self.scraping_url_directory, bypass_robots=True, over_write=True) @retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2) @ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, pywebcopy.exceptions.AccessError]) def __open_scarping_session(self, proxy: str = None) -> bool: """ This method responsible to create connection with the requested url with "pywebcopy" :param proxy: proxy server :return: bool - True OR exception from "ExceptionDecorator" """ self._web_page.get(self.url, proxies={"http": proxy, "https": proxy}, verify=False, timeout=30) return True @retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2) @ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError]) def __perform_scraping_all_files(self) -> bool: """ This method responsible to scrape the requested website :return: bool - True OR exception from "ExceptionDecorator" """ self._web_page.save_html() self._web_page.save_complete() return True
I'm just creating an object with URL. for example - https://walla.co.il/
The threads are stored in _threads attribute of the WebPage object. So just iterate over them and join them like any thread.
Also use only one method save_webpage
@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError])
def __perform_scraping_all_files(self) -> bool:
"""
This method responsible to scrape the requested website
:return: bool - True OR exception from "ExceptionDecorator"
"""
return self._web_page.save_complete()
If you want objective type interface then you should definitely try the pywebcopy 7 beta version here http://github.com/rajatomar788/pywebcopy7 it has WebPage object implemented for this use case.
@rajatomar788 Great will check it. Thanks.
This could be a simple hack. For everyone searching for solution in future it should do the trick.
# start the saving process
self._web_page.save_complete()
# join the sub threads
for t in self._web_page._threads:
if t.is_alive():
t.join(timeout=1)
# location of the html file written
return self._web_page.file_path
Hi, The scraping process every time stucks after several minutes. It's stucks on "webpage - Level 100 - Queueing download of <21> asset files."
Can someone assist me please?