Scrapes domains from one input URL or from a file list of domains for broken links, valid emails and valid social media links.
Check URL's from text file to scrape for emails and social media links. This also checks common paths found from the input domain such as contact and team pages to add to the queue to new URL's to scrape for emails and social media links. This does not store broken links, but does output them to STDOUT during runtime. This does save to a file all valid & unique emails addresses and social media links during runtime so data is stored in the event of an error.
Scrape for emails & social media links, checking for promising new links to scrape
$ ./domain_scraper.py [INPUT FILE] --scrape-n
$ ./domain_scraper.py [INPUT FILE] --scrape
$ ./domain_scraper.py --url [URL TO SCRAPE]
$ ./domain_scraper.py [INPUT FILE] --check
$ ./domain_scraper.py [INPUT FILE] --extract
Data is written to a file during runtime of the email and social media scraper.
./file_storage
how to cleanup a .csv file
$ cat example_file_bad_format.txt
https://google.com/^Mhttps://cecinestpasun.site/^Mhttps://google.com/^Mhttp://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png
# replace ^M character after copying from .csv file
$ tr '\r' '\n' < example_file_bad_format.txt > example_file.txt
# remove repeat links
$ awk '!seen[$0]++' example_file.txt > example_file_no_repeats.txt
$ cat example_file.txt
https://google.com/
https://cecinestpasun.site/
http://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png
MIT License