rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

Processing of a NodeBB Forum post does not terminate #69

Closed bm765 closed 3 years ago

bm765 commented 3 years ago

Hi,

thanks for bringing and keeping this software to live!

When I try to execute the following project, the script never gets executed completely but seems to be stuck somewhere in between.

Is there anything I can do about it?

For me it seems it could be related to NodeBB forums in general but not only this site.

Thanks in advance for taking care!

import sys

import sys
import pywebcopy
from pywebcopy import save_webpage

print("pywebcopy.__version__=", pywebcopy.__version__)
print("sys.version=", sys.version)
print("sys.version_info=", sys.version_info)

url = r'https://bethesda.net/community/topic/208707/changing-the-language-of-the-game/3?language%5B%5D=en'

kwargs = {'project_name': 'some-fancy-name'}

kwargs['bypass_robots'] = True
kwargs['over_write'] = True

save_webpage(
    url=url,
    project_folder='tmp',
    **kwargs
)

#
print('fin')
sys.exit(1)

The log:

/home/user/venv/qtpylib/bin/python3.6 /home/user/PycharmProjects/webscraping/attic01.py
pywebcopy.__version__= 6.3.0
sys.version= 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
sys.version_info= sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
requests.packages.urllib3.connectionpool - INFO     - Starting new HTTPS connection (1): bethesda.net
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/robots.txt
/home/user/venv/qtpylib/lib/python3.6/site-packages/pywebcopy/webpage.py:84: UserWarning: Global Configuration is not setup. You can ignore this if you are going manual.This is just one time warning regarding some unexpected behavior.
  "Global Configuration is not setup. You can ignore this if you are going manual."
pywebcopy.configs - WARNING  - Forcefully Accessing restricted website part https://bethesda.net/community/topic/208707/changing-the-language-of-the-game/3?language%5B%5D=en
pywebcopy.configs - INFO     - Got response 302 from https://bethesda.net/community/topic/208707/changing-the-language-of-the-game/3?language%5B%5D=en
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/community/login
webpage    - INFO     - Starting save_complete Action on url: ['https://bethesda.net/community/login']
parsers    - INFO     - Parsing tree with source: <<requests.packages.urllib3.response.HTTPResponse object at 0x7f2392bcbc18>> encoding <utf-8> and parser <<lxml.etree.HTMLParser object at 0x7f2392ba4470>>
webpage    - INFO     - Starting save_assets Action on url: 'https://bethesda.net/community/login'
webpage    - Level 100 - Queueing download of <34> asset files.
pywebcopy.configs - WARNING  - Forcefully Accessing restricted website part https://bethesda.net/community/assets/client.css?v=ep3knp8q5oc
requests.packages.urllib3.connectionpool - INFO     - Starting new HTTPS connection (2): bethesda.net
webpage    - INFO     - Starting save_html Action on url: 'https://bethesda.net/community/login'
webpage    - INFO     - WebPage saved successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/44ba5b07__login.html
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/shared/core/3/global.css
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/community/assets/client.css?v=ep3knp8q5oc
elements   - INFO     - Writing file at location /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/eeeb7ede__client.css
elements   - INFO     - [133] CSS linked files are found in file [/home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/shared/core/3/80b7bf38__global.css]
elements   - INFO     - Writing file at location /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/shared/core/3/80b7bf38__global.css
elements   - INFO     - File of type .css written successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/shared/core/3/80b7bf38__global.css
elements   - INFO     - File of type .css written successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/eeeb7ede__client.css
pywebcopy.configs - WARNING  - Forcefully Accessing restricted website part https://bethesda.net/community/assets/client.css?v=ep3knp8q5oc
pywebcopy.configs - WARNING  - Forcefully Accessing restricted website part https://bethesda.net/community/assets/uploads/system/favicon.ico?v=ep3knp8q5oc
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/community/assets/uploads/system/favicon.ico?v=ep3knp8q5oc
elements   - INFO     - Writing file at location /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/uploads/system/025318a4__favicon.ico
elements   - INFO     - File of type .ico written successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/uploads/system/025318a4__favicon.ico
pywebcopy.configs - WARNING  - Forcefully Accessing restricted website part https://bethesda.net/community/assets/uploads/system/favicon.ico?v=ep3knp8q5oc
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/community/assets/client.css?v=ep3knp8q5oc
pywebcopy.configs - INFO     - Got response 200 from https://bethesda.net/community/assets/uploads/system/favicon.ico?v=ep3knp8q5oc
elements   - INFO     - [0] CSS linked files are found in file [/home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/uploads/system/025318a4__favicon.ico]
elements   - INFO     - Writing file at location /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/uploads/system/025318a4__favicon.ico
elements   - INFO     - File of type .ico written successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/uploads/system/025318a4__favicon.ico
elements   - INFO     - [98] CSS linked files are found in file [/home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/eeeb7ede__client.css]
elements   - INFO     - Writing file at location /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/eeeb7ede__client.css
elements   - INFO     - File of type .css written successfully to /home/user/PycharmProjects/webscraping/tmp/some-fancy-name/bethesda.net/community/assets/eeeb7ede__client.css
rajatomar788 commented 3 years ago

Use the single threaded version of pywebcopy which @davidwgrossman made

https://github.com/davidwgrossman/pywebcopy/commit/33f8e808bfb3e4816314357db2f5f0e9d384f2fe