Website Scraper Plugin Only Scanning Homepage

ashmilhussain commented 1 month ago

The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag. This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.

Use flag : is_scan_child 'true' : scraper both main page and subpages 'false' : scrap only main page

YashBaviskar1 commented 1 month ago

Hey there! I would love to try to solve this issue. Maybe i will try to recursively scan the page whenever the flag is true. thank you

ashmilhussain commented 1 month ago

Hey @YashBaviskar1 ,

Assigning this issue to you, happy coding

If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw

agberoz commented 1 month ago

@YashBaviskar1 any updates on this?

YashBaviskar1 commented 1 month ago

Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py I was just figuring out a way on how to use def load(self) in Class UrlReader(DocsReader) to recursively call itself, is this the right approach.

Sorry for the delay

agberoz commented 1 month ago

@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links.

I would also suggest using a data structure like a queue. Here's how it can work:

Start with the base URL.
Add child URLs to the queue as you process each page.
Ensure that each URL is only visited once (you can use a set to track visited URLs).
Process each URL iteratively from the queue.

YashBaviskar1 commented 1 month ago

Yup thank you! I will try to do what you have said, it is pretty intuitive

YashBaviskar1 commented 1 month ago

@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :

In Class Website I added is_scan_child flag in __init__ and changed the fetch_data accordingly

class Website:
"""
Website class for interacting with website data.
"""
def __init__(self, website_url: str, is_scan_child : bool = False):
    self.connection = {}
    self.is_scan_child = is_scan_child #default is False 
    self.params = {
        'url': website_url,
    }

def fetch_data(self):
    base_reader = UrlReader({
        "type": "url",
        "path": [self.params.get('url')],
        "is_scan_child" : self.is_scan_child
    })
    data = base_reader.load()
    return data

Now in Class UrlReader I did as you instructed, initialized a url_queue queue and a visited_url set and used find_all to find all the href tags and append in queue if the is_scan_child is true. The process iterates through the entire queue, while the set keeps track of visited links.

class UrlReader:
def __init__(self, source):
    self.source = source
    self.visited_url = set()

def load(self):
    out = []
    is_scan_child = self.source.get("is_scan_child", False)
    url_queue = [] #defining a queue to capture sub_urls 
    if "path" in self.source:
        urls = self.source["path"]
        for url in urls:
            url_queue.append(url)
        while url_queue : 
            url = url_queue.pop(0)
            if url in self.visited_url :
                continue
            self.visited_url.add(url)
            try:
                response = requests.get(url)
                if response.status_code == 200:
                    soup = BeautifulSoup(response.content, 'html.parser')
                    if is_scan_child :
                        for a in soup.find_all('a', href = True) :
                            absolute_url = urljoin(url, a['href'])
                            if absolute_url not in self.visited_url :
                                url_queue.append(absolute_url)
                        #print(f" Sub Urls are : {url_queue}")
                    tag = soup.body
                    text = ''.join(list(tag.strings)[:-1])
                    metadata = {
                        "path": url
                    }
                    out.append({"content": str(text), "metadata": metadata})
                else:
                    logger.critical(f"Failed to retrieve content, status code: {response.status_code}")
            except Exception as e:
                logger.error(e)
    return out

Since I thought we are dealing with subpages mostly I added this line and note the new import too:

from urllib.parse import urljoin

absolute_url = urljoin(url, a['href'])

I do not know if it is necessary or not.

ashmilhussain commented 1 month ago

@agberoz check this

agberoz commented 1 month ago

@YashBaviskar1 It's looking good!

A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains.

YashBaviskar1 commented 1 month ago

@agberoz Sure, we can do that using .netloc to include and compare the domain of the urls

from urllib.parse import urlparse

#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc

and then add this condition when appending the child url

if is_scan_child :
    for a in soup.find_all('a', href = True) :
        absolute_url = urljoin(url, a['href'])
        if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
            url_queue.append(absolute_url)

if the absolute_url is of different domain, it will not be added in the url_queue and hence the crawler will stay on the base domain.

YashBaviskar1 commented 1 month ago

Ok i will soon send a PR if everything works! Thank you

sirocco-ventures / raggenie

Website Scraper Plugin Only Scanning Homepage #17