johncoleman83 / domain_scraper

Scrapes domains for broken links, emails & social media links (uses beautifulsoup)
MIT License
2 stars 3 forks source link
beautifulsoup4 email-marketing opensource python3 scraping-websites social-media

domain_scraper

Scrapes domains from one input URL or from a file list of domains for broken links, valid emails and valid social media links.

Usage

$ ./domain_scraper.py [INPUT FILE] --scrape-n
$ ./domain_scraper.py [INPUT FILE] --scrape
$ ./domain_scraper.py --url [URL TO SCRAPE]
$ ./domain_scraper.py [INPUT FILE] --check
$ ./domain_scraper.py [INPUT FILE] --extract

Data storage

Data is written to a file during runtime of the email and social media scraper.

Example file & file cleanup

how to cleanup a .csv file

$ cat example_file_bad_format.txt
https://google.com/^Mhttps://cecinestpasun.site/^Mhttps://google.com/^Mhttp://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

# replace ^M character after copying from .csv file
$ tr '\r' '\n' < example_file_bad_format.txt > example_file.txt

# remove repeat links
$ awk '!seen[$0]++' example_file.txt > example_file_no_repeats.txt

$ cat example_file.txt
https://google.com/
https://cecinestpasun.site/
http://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

Author

Contributors

License

MIT License