rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

pywebcopy/configs.py, setup_paths method changes the working directory #33

Closed TonySchneider closed 4 years ago

TonySchneider commented 4 years ago

Hello, I encountered an issue with the scrapping tool. There is a command -> os.chdir(norm_p) in the setup_paths method that changes the working directory. The issue appears when I send several scraping requests to the same config. It tried to write file to recursive folders. For example: If I'm trying scrap files to output directory with folder name 'test' So, the second request will try to scrap to /output/test/output/test/.. To fix it I wrote the following lines: for n in range(3): os.chdir("..")

Please advise

rajatomar788 commented 4 years ago

If you are worried about the working directory change then you can always setup the config again before issuing a save_complete command on webpage. The config is just a dictionary and the key for path is project_folder.

kaavik commented 4 years ago

I also ran into this problem using pywebcopy.save_webpage(). if you use pywebcopy in a loop, it is problematic because it seems you change directories to the project_folder path but then do not revert back. Thus, if you have a loop using pywebcopy, the project_folder becomes nested deeper and deeper until the path becomes too long and fails with message below:

ExceptionType: <class 'OSError'>, Exception: OSError(63, 'File name too long')

I worked around it in my own code per below but would consider it a bug in pywebcopy. As a workaround, I reset the current path back to whatever it was originally before using pywebcopy.

def write_html_to_file(self, pfxlist):
    '''
    Given a list of dicts with k:v of {'IPS': [{'DOMAIN': 'somedomain.com'}], [{'DOMAIN': 'otherdomain.com'}]},
    use pywebcopy to make an offline copy of web results from http://somesite.com
    :param pfxlist:
        List of dicts with 'IPS' key and 'DOMAIN' subkey
    :return:
        List of dicts with 'DNS' key added to the existing dict for key 'IPS['DOMAIN']'.
        Value of k = 'DNS' is a string with path to downloaded HTML page.
        Ex: [{'CIDR': '104.207.85.233/32', 'EMAIL': '104.207.85.233_32.zip}]
    '''
    pywebcopy.config['bypass_robots'] = True
    pywebcopy.config['over_write'] = True
    pywebcopy.config['http_headers']['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'

    # Preserve current directory
    current_dir = os.getcwd()
    # Loop through each 'CIDR' key in pfxlist
    for i in range(len(pfxlist)):
        cidr = pfxlist[i]['CIDR']
        url_safe_cidr = urllib.parse.quote_plus(cidr)
        ip, bits = cidr.split(sep='/')
        project_name= '{ip}_{bits}'.format(ip=ip, bits=bits)
        print("Preserving 'EMAIL' from somesite.com for {cidr}... ".format(cidr=cidr, end=''))
        time.sleep(5)
        webpage = 'https://somesite.com/lookup?search={cidr}'.format(cidr=url_safe_cidr)
        pdir = '{dir}/'.format(dir=self.dir)
        try:
            pywebcopy.save_webpage(url=webpage, project_folder=pdir, project_name=project_name)
            pfxlist[i]['EMAIL']='{project}.zip'.format(project=project_name)
            print('Done')
        except Exception as e:
            print('ExceptionType: {t}, Exception: {r}'.format(t=type(e), r=repr(e)))
            pfxlist[i]['EMAIL']='N/A'
        # Revert to original directory
        os.chdir(current_dir)

    return pfxlist
TonySchneider commented 4 years ago

Thanks @kaavik !