jitacm / -30DaysDevChallenge-

Welcome to the 30DayDevChallenge repository! This repository is dedicated to a month-long coding challenge designed to help developers of all levels enhance their skills through daily coding tasks and projects.
12 stars 12 forks source link

Web Crawler for Email Extraction/30_days_of_Python #54

Closed CLOUDyy003 closed 2 weeks ago

CLOUDyy003 commented 2 weeks ago

Description

This Python script is a web crawler designed to extract email addresses from a target URL. It leverages the requests library to fetch web page content and the BeautifulSoup library to parse HTML. The script uses a breadth-first search approach to navigate through links found on the pages, collecting email addresses along the way.

Key features

Target URL Input: Users can specify the initial URL to start the crawling process. Breadth-First Search: The script processes URLs in a breadth-first manner, ensuring a wide exploration of the website. Email Extraction: Utilizes regular expressions to identify and collect email addresses from the fetched web pages. Link Resolution: Handles relative and absolute URLs to ensure all links are correctly processed. Duplicate Handling: Keeps track of processed URLs to avoid redundancy. Interrupt Handling: Gracefully handles user interrupts (Ctrl+C) to stop the script. Scalability: Processes up to 100 URLs by default, with easy modification for larger crawls.

Required Libraries

Windows and Linux Installation Commands

Windows To install the required libraries on Windows, open Command Prompt and run:

pip install beautifulsoup4 requests lxml

Linux To install the required libraries on Linux, open Terminal and run:

pip install beautifulsoup4 requests lxml

Alternative Method Open Command Prompt(windows)/Terminal(Linux) and run: (Make sure you are in the same directory where the requirements.txt is saved before executing the command)

pip install -r requirements.txt

This method is convenient for managing dependencies in a project and ensures that all required libraries are installed consistently.

Step-by-step breakdown of how the script works

1. Initialize Variables:

Crawl Loop:

Fetching and Parsing:

Link Handling:

  • For each link found, it resolves relative URLs to absolute URLs.
  • It checks if the link has already been processed or is in the queue to be processed. If not, it adds the link to the urls deque.

Output:

  • The script prints each extracted email address.

Error Handling:

  • The script handles KeyboardInterrupt to allow the user to stop the script with a keyboard interrupt.
jitacm commented 2 weeks ago

go ahead !