domain_scraper

Scrapes domains from one input URL or from a file list of domains for broken links, valid emails and valid social media links.

Usage

Check URL's from text file to scrape for emails and social media links. This also checks common paths found from the input domain such as contact and team pages to add to the queue to new URL's to scrape for emails and social media links. This does not store broken links, but does output them to STDOUT during runtime. This does save to a file all valid & unique emails addresses and social media links during runtime so data is stored in the event of an error.
Scrape for emails & social media links, checking for promising new links to scrape

$ ./domain_scraper.py [INPUT FILE] --scrape-n

Same as above, but do not check for new links to add to the queue

$ ./domain_scraper.py [INPUT FILE] --scrape

To check for broken links only from all URLS from the same domain based off of one main input URL

$ ./domain_scraper.py --url [URL TO SCRAPE]

Check URL's for broken links from text file

$ ./domain_scraper.py [INPUT FILE] --check

extract name associations from email list (used with results from scraping)

$ ./domain_scraper.py [INPUT FILE] --extract

Data storage

Data is written to a file during runtime of the email and social media scraper.

Data is written to file at runtime.
Specific errors are not written to the file, but instead printed to STDOUT
Files are stored in path ./file_storage

Example file & file cleanup

how to cleanup a .csv file

$ cat example_file_bad_format.txt
https://google.com/^Mhttps://cecinestpasun.site/^Mhttps://google.com/^Mhttp://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

# replace ^M character after copying from .csv file
$ tr '\r' '\n' < example_file_bad_format.txt > example_file.txt

# remove repeat links
$ awk '!seen[$0]++' example_file.txt > example_file_no_repeats.txt

$ cat example_file.txt
https://google.com/
https://cecinestpasun.site/
http://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

Author

David John Coleman II, davidjohncoleman.com | @djohncoleman

Contributors

edikxl, @edikxl
mrvnmchm, @mrvnmchm

License

MIT License

johncoleman83 / domain_scraper

readme