rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

load_css/images/javascript arguments not working #47

Closed youngblood closed 4 years ago

youngblood commented 4 years ago

Using the load_css, load_images, and load_javascript arguments for config.setup_config() and save_webpage() doesn't seem to restrict the types of files downloaded. Using False for all still resulted in css, image, an javascript files downloaded.

That said, they do appear to have some effect. When I set those arguments to False using config.setup_config() they seem to have no effect, and the below code still hangs when saving the first URL in the list. But when I also pass those parameters to the save_webpage() function, it still downloads all those filetypes (so doesn't work as I would expect) but it does cause the program to run to completion. Unclear why passing those arguments directly to save_webpage() is allowing the program to finish.

Next, I tried to set the allowed_file_ext argument for both config.setup_config() and save_webpage() but neither accepts that argument.

So finally I directly set config['allowed_file_ext'] = ['.html','.css','svg','.js','.jpg','.png','.htm','jpeg'] and that did seem to restrict the file types downloaded for the most part, although it is still downloading some other types like '.pwc'.

Code:


# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

urls = [
    'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
    'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
    'http://www.history.com/topics/cold-war/hollywood-ten'
]

pywebcopy.config.setup_config(
    project_url=project_url,
    project_folder=project_folder,
    project_name=project_name,
    over_write=True,
    bypass_robots=True,
    debug=False,
    log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
    join_timeout=5,
    load_css=False,
    load_images=False,
    load_javascript=False
)

start = preferred_clock()

# pywebcopy.config['allowed_file_ext'] = ['.html','.css','svg','.js','.jpg','.png','.htm','jpeg']

# method_1
for url in urls:
    pywebcopy.save_webpage(url=url,
                           project_folder=project_folder,
                           project_name=project_name,
                           join_timeout=5)#,
                           #load_css=False,
                           #load_images=False,
                           #load_javascript=False)

for thread in threading.enumerate():
    if thread == threading.main_thread():
        continue
    else:
        thread.join()

print("Execution time : ", preferred_clock() - start)```
rajatomar788 commented 4 years ago

The load_css, load_js and load_images arguments are only accepted in the api functions save_webpage and save_website.

If you manually want to see the implementation then go here:

https://github.com/rajatomar788/pywebcopy/blob/0741bb75aafca63152d68a4de11a234afd91f913/pywebcopy/api.py#L66

Yes there shouldn't be a config circular implementation, it should be refactored in the next release.

youngblood commented 4 years ago

That makes sense. Is it true that setting pywebcopy.config['allowed_file_ext'] directly should work for both the API and the WebPage class? Is there a more preferred way to set those restrictions when working with the WebPage class? Thanks!

rajatomar788 commented 4 years ago

Yes. Setting the pywebcopy.config['allowed_file_ext'] directly should work for both the API and the WebPage class

rajatomar788 commented 4 years ago

I am closing it as it is resolved.