rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

inconsistent handling of filetypes #48

Closed youngblood closed 4 years ago

youngblood commented 4 years ago

Using the 'WebPage' class and WebPage.save_assets(), and having explicitly set pywebcopy.config['allowed_file_ext'] = ['.html','.css'], I'm seeing inconsistent handling of some filetypes. Specifically, it seems to be misinterpreting filetypes at times: image

From what I can tell, the same issue does not happen when using pywebcopy.save_webpage().

Code:


# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

urls = [
    'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
    'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
    'http://www.history.com/topics/cold-war/hollywood-ten'
]

pywebcopy.config.setup_config(
    project_url=project_url,
    project_folder=project_folder,
    project_name=project_name,
    over_write=True,
    bypass_robots=True,
    debug=False,
    log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
    join_timeout=1,
    load_css=False,
    load_images=False,
    load_javascript=False
)

pywebcopy.config['allowed_file_ext'] = ['.html','.css']#,'svg','.js','.jpg','.png','.htm','jpeg']

start = preferred_clock()

# method_1
for url in urls:
    pywebcopy.save_webpage(url=url,
                           project_folder=project_folder,
                           project_name=project_name,
                           join_timeout=1,
                           load_css=False,
                           load_images=False,
                           load_javascript=False)
    for thread in threading.enumerate():
        if thread == threading.main_thread():
            continue
        else:
            thread.join()

print("Execution time : ", preferred_clock() - start)```
youngblood commented 4 years ago

I have actually since seen some instances of this with the save_webpage() method. I also just saw this in a run where '.svg' is explicitly allowed via

pywebcopy.config['allowed_file_ext'] = ['.html','.css','svg','.js',
                                        '.jpg','.png','.htm','jpeg',
                                        '.php','.asp','.aspx','xhtml',
                                        '.xml','.gif','.pdf','.json']

image

rajatomar788 commented 4 years ago

It is working as expected. The .png files that are allowed despite config restrictions are actually internal css linked files. For example: if a css rule has a background property set as an image, here irrespective of the config restrictions the file would be downloaded and you would see an message like

.css file type is allowed for image.jpeg

So it is expected behaviour for any kind of file that is found inside css rules.

youngblood commented 4 years ago

That makes perfect sense - thank you for replying so quickly!

rajatomar788 commented 4 years ago

This issue is resolved.