sirocco-ventures / raggenie

RAGGENIE: An open-source, low-code platform to build custom Retrieval-Augmented Generation (RAG) Copilets with your own data. Simplify AI development with ease!
https://www.raggenie.com
MIT License
87 stars 42 forks source link

Website Scraper Plugin Only Scanning Homepage #17

Open ashmilhussain opened 1 month ago

ashmilhussain commented 1 month ago

The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag. This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.

Use flag : is_scan_child 'true' : scraper both main page and subpages 'false' : scrap only main page

YashBaviskar1 commented 1 month ago

Hey there! I would love to try to solve this issue. Maybe i will try to recursively scan the page whenever the flag is true. thank you

ashmilhussain commented 1 month ago

Hey @YashBaviskar1 ,

Assigning this issue to you, happy coding

If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw

agberoz commented 1 month ago

@YashBaviskar1 any updates on this?

YashBaviskar1 commented 1 month ago

Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py I was just figuring out a way on how to use def load(self) in Class UrlReader(DocsReader) to recursively call itself, is this the right approach.

Sorry for the delay

agberoz commented 1 month ago

@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links.

I would also suggest using a data structure like a queue. Here's how it can work:

  1. Start with the base URL.
  2. Add child URLs to the queue as you process each page.
  3. Ensure that each URL is only visited once (you can use a set to track visited URLs).
  4. Process each URL iteratively from the queue.
YashBaviskar1 commented 1 month ago

Yup thank you! I will try to do what you have said, it is pretty intuitive

YashBaviskar1 commented 1 month ago

@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :

Since I thought we are dealing with subpages mostly I added this line and note the new import too:

from urllib.parse import urljoin

absolute_url = urljoin(url, a['href'])

I do not know if it is necessary or not.

ashmilhussain commented 1 month ago

@agberoz check this

agberoz commented 1 month ago

@YashBaviskar1 It's looking good!

A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains.

YashBaviskar1 commented 1 month ago

@agberoz Sure, we can do that using .netloc to include and compare the domain of the urls

from urllib.parse import urlparse

#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc

and then add this condition when appending the child url

if is_scan_child :
    for a in soup.find_all('a', href = True) :
        absolute_url = urljoin(url, a['href'])
        if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
            url_queue.append(absolute_url)

if the absolute_url is of different domain, it will not be added in the url_queue and hence the crawler will stay on the base domain.

YashBaviskar1 commented 1 month ago

Ok i will soon send a PR if everything works! Thank you