Closed TonySchneider closed 4 years ago
If you are worried about the working directory change then you can always setup the config again before issuing a save_complete command on webpage.
The config is just a dictionary and the key for path is project_folder
.
I also ran into this problem using pywebcopy.save_webpage(). if you use pywebcopy in a loop, it is problematic because it seems you change directories to the project_folder path but then do not revert back. Thus, if you have a loop using pywebcopy, the project_folder becomes nested deeper and deeper until the path becomes too long and fails with message below:
ExceptionType: <class 'OSError'>, Exception: OSError(63, 'File name too long')
I worked around it in my own code per below but would consider it a bug in pywebcopy. As a workaround, I reset the current path back to whatever it was originally before using pywebcopy.
def write_html_to_file(self, pfxlist):
'''
Given a list of dicts with k:v of {'IPS': [{'DOMAIN': 'somedomain.com'}], [{'DOMAIN': 'otherdomain.com'}]},
use pywebcopy to make an offline copy of web results from http://somesite.com
:param pfxlist:
List of dicts with 'IPS' key and 'DOMAIN' subkey
:return:
List of dicts with 'DNS' key added to the existing dict for key 'IPS['DOMAIN']'.
Value of k = 'DNS' is a string with path to downloaded HTML page.
Ex: [{'CIDR': '104.207.85.233/32', 'EMAIL': '104.207.85.233_32.zip}]
'''
pywebcopy.config['bypass_robots'] = True
pywebcopy.config['over_write'] = True
pywebcopy.config['http_headers']['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
# Preserve current directory
current_dir = os.getcwd()
# Loop through each 'CIDR' key in pfxlist
for i in range(len(pfxlist)):
cidr = pfxlist[i]['CIDR']
url_safe_cidr = urllib.parse.quote_plus(cidr)
ip, bits = cidr.split(sep='/')
project_name= '{ip}_{bits}'.format(ip=ip, bits=bits)
print("Preserving 'EMAIL' from somesite.com for {cidr}... ".format(cidr=cidr, end=''))
time.sleep(5)
webpage = 'https://somesite.com/lookup?search={cidr}'.format(cidr=url_safe_cidr)
pdir = '{dir}/'.format(dir=self.dir)
try:
pywebcopy.save_webpage(url=webpage, project_folder=pdir, project_name=project_name)
pfxlist[i]['EMAIL']='{project}.zip'.format(project=project_name)
print('Done')
except Exception as e:
print('ExceptionType: {t}, Exception: {r}'.format(t=type(e), r=repr(e)))
pfxlist[i]['EMAIL']='N/A'
# Revert to original directory
os.chdir(current_dir)
return pfxlist
Thanks @kaavik !
Hello, I encountered an issue with the scrapping tool. There is a command -> os.chdir(norm_p) in the setup_paths method that changes the working directory. The issue appears when I send several scraping requests to the same config. It tried to write file to recursive folders. For example: If I'm trying scrap files to output directory with folder name 'test' So, the second request will try to scrap to /output/test/output/test/.. To fix it I wrote the following lines: for n in range(3): os.chdir("..")
Please advise