s-rah / onionscan

OnionScan is a free and open source tool for investigating the Dark Web.
https://twitter.com/OnionScan
Other
2.89k stars 599 forks source link

Is there any alternative for 'snapshot'? #151

Open powerfulTrouser opened 6 years ago

powerfulTrouser commented 6 years ago

I'm a student and I'm trying to follow this site

http://www.automatingosint.com/blog/2016/09/dark-web-osint-part-four-using-scikit-learn-to-find-hidden-service-clones/

to use machine learning to analysis dark web. But I had found that 'snapshot' became unavailable. Then I found an issue said this function had been moved to dat_0 My dat_0 file is about 10G. I tried to parse it by python and kaitai struct but failed. onions.py.txt parsedat.py.txt Is there any way to at least implement the analysis from the website? (use old version onionscan or some tutorial of how to achieve same goal by new onionscan or somewhat)

Thanks!

powerfulTrouser commented 6 years ago

Finally I use python to parse dat_0 to many many many json file

`# coding:utf-8 import json import sys import os import stat

i = 0 knife = '{"Page":{"Status":'

def is_json(myjson): try: json_object = json.loads(myjson) except ValueError as e: try: json_object = json.loads(myjson.rsplit('}', 2)[0] + '}') except ValueError as e: print(e) print(myjson) return 0 print(myjson.rsplit('}', 2)[0] + '}') return myjson.rsplit('}', 2)[0] + '}' return myjson

with open('/Home/dat_0.json') as f: for line in f: for frag in s.split(knife): if len(frag) is 0 and '{' not in frag: del frag else: frag = frag.rsplit('}', 1)[0] frag = knife + frag + '}' frag = str(frag) if is_json(frag) is not 0: result_json = json.loads(is_json(frag)) if result_json['Page']['Status'] != 403 and result_json['Page']['Status'] != 404: print("下一個") path = ('/Home/parse dat-1/' + result_json['URL'].encode('utf8')[7:-1].replace('/', '斜線')+'.json') try: f = open(path, 'w+') except IOError as e: path = ('/Home/parse dat-1/' + '有問題'+str(i)+'.json') i = i + 1 print(e) f = open(path, 'w+') f.write(frag) f.close()
` It won't generate json file which status is 403 or 404. I use '{"Page":{"Status":' to split the file, wondering there's any better cut string. This is not a beautiful solution, but it works however.