Scraping process stucks

TonySchneider commented 4 years ago

pywebcopy.configs - INFO elements - INFO elements - INFO elements - INFO elements - INFO pywebcopy.configs - INFO pywebcopy.configs - INFO elements - INFO elements - INFO elements - INFO elements - INFO elements - INFO elements - INFO pywebcopy.configs - INFO elements - INFO elements - INFO elements - INFO root - INFO root - INFO certificate root - INFO root - INFO pywebcopy.configs - INFO pywebcopy.configs - INFO webpage - INFO parsers - INFO webpage - INFO webpage - INFO webpage - INFO webpage - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/0e250b22PwZc-YbIL414wB9rB1IAPRJtnKITppOI_IvcXXDNrsc.woff2 - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2] - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/u0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2] - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2] - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 - Processing http://www.remixpr.in/Folder/index.php for certificate X.509 retrieval - Fetching certificates from http://www.remixpr.in/Folder/index.php ended. one line - MIIFpzCCBI+gAwIBAgISBFpnn7qN+tzenpqVs++XHHJpMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQDExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0xOTA5MjIyMTI3NDdaFw0xOTEyMjEyMTI3NDdaMBUxEzARBgNVBAMTCnJlbWl4cHIuaW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDL02a0T2ikBCOmO28cOgcw8HO7WGJAogcB4xW9PWleU/SlTbm7/nB1V7pL8BVdB+RJAgjNw81973s7mFx1UsULM+iaP+TwzoXAWNSW0uxCwg8/Psqz9oqw2DA5vIpwGM07CMTZ2LVupgu/HSL7FrSsRPOAr37XPer5zOmoCcg1V0eg7D3ild8xFY2XITn9ZIBr5uhTipnRJE5jkBBdvx3aAYOs4mdSToKfDPVauisSw44c3ngYYekx0kLN4NZBF2A7RYGTugZy6Cjz6eumxExKSphCOkPMpR8wGSvd+NtVAIwhL49V5P5XpTBbmpPUMAF63OujHPL5QIp01Vh6x3rDAgMBAAGjggK6MIICtjAOBgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMAwGA1UdEwEB/wQCMAAwHQYDVR0OBBYEFGkGo178Qrb8WXpvhl8WVkg/tvH/MB8GA1UdIwQYMBaAFKhKamMEfd265tE5t6ZFZe/zqOyhMG8GCCsGAQUFBwEBBGMwYTAuBggrBgEFBQcwAYYiaHR0cDovL29jc3AuaW50LXgzLmxldHNlbmNyeXB0Lm9yZzAvBggrBgEFBQcwAoYjaHR0cDovL2NlcnQuaW50LXgzLmxldHNlbmNyeXB0Lm9yZy8wcQYDVR0RBGowaIIRY3BhbmVsLnJlbWl4cHIuaW6CD21haWwucmVtaXhwci5pboIKcmVtaXhwci5pboISd2ViZGlzay5yZW1peHByLmlughJ3ZWJtYWlsLnJlbWl4cHIuaW6CDnd3dy5yZW1peHByLmluMEwGA1UdIARFMEMwCAYGZ4EMAQIBMDcGCysGAQQBgt8TAQEBMCgwJgYIKwYBBQUHAgEWGmh0dHA6Ly9jcHMubGV0c2VuY3J5cHQub3JnMIIBAwYKKwYBBAHWeQIEAgSB9ASB8QDvAHUAKTxRllTIOWW6qlD8WAfUt2+/WHopctykwwz05UVH9HgAAAFtWxaNbQAABAMARjBEAiBpPidecYxpa8eG2BHncbS1y+dQ0EhqL1obDRhWvZWRvAIgDVxJid4sS4zBmingqLtsqDe43IsaKade0py8jjK0OjQAdgBvU3asMfAxGdiZAKRRFf93FRwR2QLBACkGjbIImjfZEwAAAW1bFo25AAAEAwBHMEUCIQCsJKuz2iJpwYQgoiHLB0KenepU1ce7hfgbabPjv5wtMgIgd16nc5T1eZUrhaVYWVyOMCUvx+q7Yca+lGvYim/DumswDQYJKoZIhvcNAQELBQADggEBAA4kkTiDPhcMrmAgg1xCb3eZyb/endaWVooO+TTgFoSNju9KPkhyCCkzB3SF3M1VbaZI+O/1lip5WV8JjoNxTKKt0eMo5PpyxPebBGoJ/XpetdJT8e/EjRa61CaiJnfY3rs5u9iH9wDD8M7CmrqkK5qD4S68TYcgCb4tXB4bPFklDZ37OkKrShzWN7gDKlkGk8XSUdYMuRn9M2RmLObeKbuZmjBrp2yyGuVTOlPjczFnmsx+21UbDKrqbW8AHPzg5YEbJDr1i7nFvF4ME73/BUc4NNH2s29fyxxKFF5H2GVp86OhI36G9f7OK1ZgxXTbRWiRUZIE0975Pi0GHaWgllE= - Processing http://www.remixpr.in/Folder/index.php ended. - Processing request to scarping url http://www.remixpr.in/Folder/index.php - Got response 301 from http://www.remixpr.in/robots.txt - Got response 200 from http://www.remixpr.in/Folder/index.php - Starting save_html Action on url: 'http://www.remixpr.in/Folder/index.php' - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe00489b690>> encoding and parser <<lxml.etree.HTMLParser object at 0x7fe0079bf910>> - WebPage saved successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/www.remixpr.in/Folder/892302da__index.html - Starting save_complete Action on url: ['http://www.remixpr.in/Folder/index.php'] - Starting save_assets Action on url: 'http://www.remixpr.in/Folder/index.php' - Level 100 - Queueing download of <21> asset files.

Hi, The scraping process every time stucks after several minutes. It's stucks on "webpage - Level 100 - Queueing download of <21> asset files."

Can someone assist me please?

rajatomar788 commented 4 years ago

What code you are using to initiate this process? There are several factors in hanging. One could be server delay or slow server. Second could be threading problem which can't be solved via cmd. You will have to manually join all of them to unfreeze them.

TonySchneider commented 4 years ago

How do I can to manually unfreeze them? my code:

class ScrapingWrapper: def init(self, url, scraping_urls_path: str, scraping_url_directory: str, proxy: str = None): """ This class responsible to perform scraping of a given website or url and save the website content. content can be html, css, js etc. The class can work with proxy server.

    :param url: given website to scrape
    :param scraping_urls_path: main or root directory to save the website's content
    :param scraping_url_directory: sub directory under the main to save the content
    :param proxy: proxy server which can be used for scraping
    """
    logging.info(f"Processing request to scarping url {url}")
    self.url = url
    self.scraping_urls_path = scraping_urls_path
    self.scraping_url_directory = scraping_url_directory
    self.artifact_directory = self.scraping_urls_path + self.scraping_url_directory

    self.scraping_status = False
    self._config = self.__setup_config()

    if type(self._config) == pywebcopy.configs.ConfigHandler:
        self._web_page = pywebcopy.WebPage()
        self._session_status = self.__open_scarping_session(proxy)

        if self._session_status:
            self.scraping_status = self.__perform_scraping_all_files()

    logging.info(f"Scrapping has been performed for '{self.url}' and ended with result '{self.scraping_status}'")

@ExceptionDecorator(exceptions=[requests.exceptions.MissingSchema])
def __setup_config(self) -> pywebcopy.configs.ConfigHandler:
    """
    This method responsible to configured "pywebcopy"
    :return:
    """
    return pywebcopy.config.setup_config(self.url, self.scraping_urls_path, self.scraping_url_directory, bypass_robots=True, over_write=True)

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, pywebcopy.exceptions.AccessError])
def __open_scarping_session(self, proxy: str = None) -> bool:
    """
    This method responsible to create connection with the requested url with "pywebcopy"

    :param proxy: proxy server
    :return: bool - True OR exception from "ExceptionDecorator"
    """
    self._web_page.get(self.url, proxies={"http": proxy, "https": proxy}, verify=False, timeout=30)
    return True

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError])
def __perform_scraping_all_files(self) -> bool:
    """
    This method responsible to scrape the requested website

    :return: bool - True OR exception from "ExceptionDecorator"
    """
    self._web_page.save_html()
    self._web_page.save_complete()
    return True

I'm just creating an object with URL. for example - https://walla.co.il/

rajatomar788 commented 4 years ago

The threads are stored in _threads attribute of the WebPage object. So just iterate over them and join them like any thread.

rajatomar788 commented 4 years ago

Also use only one method save_webpage

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError])
def __perform_scraping_all_files(self) -> bool:
    """
    This method responsible to scrape the requested website

    :return: bool - True OR exception from "ExceptionDecorator"
    """
    return self._web_page.save_complete()

rajatomar788 commented 4 years ago

If you want objective type interface then you should definitely try the pywebcopy 7 beta version here http://github.com/rajatomar788/pywebcopy7 it has WebPage object implemented for this use case.

TonySchneider commented 4 years ago

@rajatomar788 Great will check it. Thanks.

rajatomar788 commented 4 years ago

This could be a simple hack. For everyone searching for solution in future it should do the trick.


# start the saving process
self._web_page.save_complete()

# join the sub threads
for t in self._web_page._threads:
    if t.is_alive():
          t.join(timeout=1)

# location of the html file written 
return self._web_page.file_path

rajatomar788 / pywebcopy

Scraping process stucks #35