rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

Scraping process stucks #35

Closed TonySchneider closed 4 years ago

TonySchneider commented 4 years ago

pywebcopy.configs - INFO - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/0e250b22PwZc-YbIL414wB9rB1IAPRJtnKITppOI_IvcXXDNrsc.woff2 elements - INFO - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2] elements - INFO - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 pywebcopy.configs - INFO - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 pywebcopy.configs - INFO - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/u0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/cc07cc68NdF9MtnOpLzo-noMoG0miPesZW2xOQ-xsNqO47m55DA.woff2 elements - INFO - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 elements - INFO - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2] elements - INFO - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/7e4c377bu0TOpm082MNkS5K0Q4rhqvesZW2xOQ-xsNqO47m55DA.woff2 pywebcopy.configs - INFO - Got response 200 from http://fonts.gstatic.com/s/roboto/v15/gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 elements - INFO - [0] CSS linked files are found in file [/home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2] elements - INFO - Writing file at location /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120__gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 elements - INFO - File of type .woff2 written successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/fonts.gstatic.com/s/roboto/v15/bdef5120gwVJDERN2Amz39wrSoZ7FxTbgVql8nDJpwnrE27mub0.woff2 root - INFO - Processing http://www.remixpr.in/Folder/index.php for certificate X.509 retrieval root - INFO - Fetching certificates from http://www.remixpr.in/Folder/index.php ended. certificate one line - MIIFpzCCBI+gAwIBAgISBFpnn7qN+tzenpqVs++XHHJpMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQDExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0xOTA5MjIyMTI3NDdaFw0xOTEyMjEyMTI3NDdaMBUxEzARBgNVBAMTCnJlbWl4cHIuaW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDL02a0T2ikBCOmO28cOgcw8HO7WGJAogcB4xW9PWleU/SlTbm7/nB1V7pL8BVdB+RJAgjNw81973s7mFx1UsULM+iaP+TwzoXAWNSW0uxCwg8/Psqz9oqw2DA5vIpwGM07CMTZ2LVupgu/HSL7FrSsRPOAr37XPer5zOmoCcg1V0eg7D3ild8xFY2XITn9ZIBr5uhTipnRJE5jkBBdvx3aAYOs4mdSToKfDPVauisSw44c3ngYYekx0kLN4NZBF2A7RYGTugZy6Cjz6eumxExKSphCOkPMpR8wGSvd+NtVAIwhL49V5P5XpTBbmpPUMAF63OujHPL5QIp01Vh6x3rDAgMBAAGjggK6MIICtjAOBgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMAwGA1UdEwEB/wQCMAAwHQYDVR0OBBYEFGkGo178Qrb8WXpvhl8WVkg/tvH/MB8GA1UdIwQYMBaAFKhKamMEfd265tE5t6ZFZe/zqOyhMG8GCCsGAQUFBwEBBGMwYTAuBggrBgEFBQcwAYYiaHR0cDovL29jc3AuaW50LXgzLmxldHNlbmNyeXB0Lm9yZzAvBggrBgEFBQcwAoYjaHR0cDovL2NlcnQuaW50LXgzLmxldHNlbmNyeXB0Lm9yZy8wcQYDVR0RBGowaIIRY3BhbmVsLnJlbWl4cHIuaW6CD21haWwucmVtaXhwci5pboIKcmVtaXhwci5pboISd2ViZGlzay5yZW1peHByLmlughJ3ZWJtYWlsLnJlbWl4cHIuaW6CDnd3dy5yZW1peHByLmluMEwGA1UdIARFMEMwCAYGZ4EMAQIBMDcGCysGAQQBgt8TAQEBMCgwJgYIKwYBBQUHAgEWGmh0dHA6Ly9jcHMubGV0c2VuY3J5cHQub3JnMIIBAwYKKwYBBAHWeQIEAgSB9ASB8QDvAHUAKTxRllTIOWW6qlD8WAfUt2+/WHopctykwwz05UVH9HgAAAFtWxaNbQAABAMARjBEAiBpPidecYxpa8eG2BHncbS1y+dQ0EhqL1obDRhWvZWRvAIgDVxJid4sS4zBmingqLtsqDe43IsaKade0py8jjK0OjQAdgBvU3asMfAxGdiZAKRRFf93FRwR2QLBACkGjbIImjfZEwAAAW1bFo25AAAEAwBHMEUCIQCsJKuz2iJpwYQgoiHLB0KenepU1ce7hfgbabPjv5wtMgIgd16nc5T1eZUrhaVYWVyOMCUvx+q7Yca+lGvYim/DumswDQYJKoZIhvcNAQELBQADggEBAA4kkTiDPhcMrmAgg1xCb3eZyb/endaWVooO+TTgFoSNju9KPkhyCCkzB3SF3M1VbaZI+O/1lip5WV8JjoNxTKKt0eMo5PpyxPebBGoJ/XpetdJT8e/EjRa61CaiJnfY3rs5u9iH9wDD8M7CmrqkK5qD4S68TYcgCb4tXB4bPFklDZ37OkKrShzWN7gDKlkGk8XSUdYMuRn9M2RmLObeKbuZmjBrp2yyGuVTOlPjczFnmsx+21UbDKrqbW8AHPzg5YEbJDr1i7nFvF4ME73/BUc4NNH2s29fyxxKFF5H2GVp86OhI36G9f7OK1ZgxXTbRWiRUZIE0975Pi0GHaWgllE= root - INFO - Processing http://www.remixpr.in/Folder/index.php ended. root - INFO - Processing request to scarping url http://www.remixpr.in/Folder/index.php pywebcopy.configs - INFO - Got response 301 from http://www.remixpr.in/robots.txt pywebcopy.configs - INFO - Got response 200 from http://www.remixpr.in/Folder/index.php webpage - INFO - Starting save_html Action on url: 'http://www.remixpr.in/Folder/index.php' parsers - INFO - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe00489b690>> encoding and parser <<lxml.etree.HTMLParser object at 0x7fe0079bf910>> webpage - INFO - WebPage saved successfully to /home/jenkins/phishing_consistency/output/DocuSign/08-03-2020/AS12876/www.remixpr.in/Folder/892302da__index.html webpage - INFO - Starting save_complete Action on url: ['http://www.remixpr.in/Folder/index.php'] webpage - INFO - Starting save_assets Action on url: 'http://www.remixpr.in/Folder/index.php' webpage - Level 100 - Queueing download of <21> asset files.

Hi, The scraping process every time stucks after several minutes. It's stucks on "webpage - Level 100 - Queueing download of <21> asset files."

Can someone assist me please?

rajatomar788 commented 4 years ago

What code you are using to initiate this process? There are several factors in hanging. One could be server delay or slow server. Second could be threading problem which can't be solved via cmd. You will have to manually join all of them to unfreeze them.

TonySchneider commented 4 years ago

How do I can to manually unfreeze them? my code:

class ScrapingWrapper: def init(self, url, scraping_urls_path: str, scraping_url_directory: str, proxy: str = None): """ This class responsible to perform scraping of a given website or url and save the website content. content can be html, css, js etc. The class can work with proxy server.

    :param url: given website to scrape
    :param scraping_urls_path: main or root directory to save the website's content
    :param scraping_url_directory: sub directory under the main to save the content
    :param proxy: proxy server which can be used for scraping
    """
    logging.info(f"Processing request to scarping url {url}")
    self.url = url
    self.scraping_urls_path = scraping_urls_path
    self.scraping_url_directory = scraping_url_directory
    self.artifact_directory = self.scraping_urls_path + self.scraping_url_directory

    self.scraping_status = False
    self._config = self.__setup_config()

    if type(self._config) == pywebcopy.configs.ConfigHandler:
        self._web_page = pywebcopy.WebPage()
        self._session_status = self.__open_scarping_session(proxy)

        if self._session_status:
            self.scraping_status = self.__perform_scraping_all_files()

    logging.info(f"Scrapping has been performed for '{self.url}' and ended with result '{self.scraping_status}'")

@ExceptionDecorator(exceptions=[requests.exceptions.MissingSchema])
def __setup_config(self) -> pywebcopy.configs.ConfigHandler:
    """
    This method responsible to configured "pywebcopy"
    :return:
    """
    return pywebcopy.config.setup_config(self.url, self.scraping_urls_path, self.scraping_url_directory, bypass_robots=True, over_write=True)

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, pywebcopy.exceptions.AccessError])
def __open_scarping_session(self, proxy: str = None) -> bool:
    """
    This method responsible to create connection with the requested url with "pywebcopy"

    :param proxy: proxy server
    :return: bool - True OR exception from "ExceptionDecorator"
    """
    self._web_page.get(self.url, proxies={"http": proxy, "https": proxy}, verify=False, timeout=30)
    return True

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError])
def __perform_scraping_all_files(self) -> bool:
    """
    This method responsible to scrape the requested website

    :return: bool - True OR exception from "ExceptionDecorator"
    """
    self._web_page.save_html()
    self._web_page.save_complete()
    return True

I'm just creating an object with URL. for example - https://walla.co.il/

rajatomar788 commented 4 years ago

The threads are stored in _threads attribute of the WebPage object. So just iterate over them and join them like any thread.

rajatomar788 commented 4 years ago

Also use only one method save_webpage

@retry(exceptions=(requests.exceptions.HTTPError, requests.exceptions.RequestException), tries=3, delay=2, jitter=2)
@ExceptionDecorator(exceptions=[requests.exceptions.HTTPError, requests.exceptions.RequestException, ValueError])
def __perform_scraping_all_files(self) -> bool:
    """
    This method responsible to scrape the requested website

    :return: bool - True OR exception from "ExceptionDecorator"
    """
    return self._web_page.save_complete()
rajatomar788 commented 4 years ago

If you want objective type interface then you should definitely try the pywebcopy 7 beta version here http://github.com/rajatomar788/pywebcopy7 it has WebPage object implemented for this use case.

TonySchneider commented 4 years ago

@rajatomar788 Great will check it. Thanks.

rajatomar788 commented 4 years ago

This could be a simple hack. For everyone searching for solution in future it should do the trick.


# start the saving process
self._web_page.save_complete()

# join the sub threads
for t in self._web_page._threads:
    if t.is_alive():
          t.join(timeout=1)

# location of the html file written 
return self._web_page.file_path