Open ashmilhussain opened 1 month ago
Hey there! I would love to try to solve this issue. Maybe i will try to recursively scan the page whenever the flag is true. thank you
Hey @YashBaviskar1 ,
Assigning this issue to you, happy coding
If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw
@YashBaviskar1 any updates on this?
Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py I was just figuring out a way on how to use def load(self) in Class UrlReader(DocsReader) to recursively call itself, is this the right approach.
Sorry for the delay
@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links.
I would also suggest using a data structure like a queue. Here's how it can work:
Yup thank you! I will try to do what you have said, it is pretty intuitive
@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :
In Class Website I added is_scan_child
flag in __init__
and changed the fetch_data
accordingly
class Website:
"""
Website class for interacting with website data.
"""
def __init__(self, website_url: str, is_scan_child : bool = False):
self.connection = {}
self.is_scan_child = is_scan_child #default is False
self.params = {
'url': website_url,
}
def fetch_data(self):
base_reader = UrlReader({
"type": "url",
"path": [self.params.get('url')],
"is_scan_child" : self.is_scan_child
})
data = base_reader.load()
return data
Now in Class UrlReader
I did as you instructed, initialized a url_queue
queue and a visited_url
set and used find_all
to find all the href tags and append in queue if the is_scan_child
is true. The process iterates through the entire queue, while the set keeps track of visited links.
class UrlReader:
def __init__(self, source):
self.source = source
self.visited_url = set()
def load(self):
out = []
is_scan_child = self.source.get("is_scan_child", False)
url_queue = [] #defining a queue to capture sub_urls
if "path" in self.source:
urls = self.source["path"]
for url in urls:
url_queue.append(url)
while url_queue :
url = url_queue.pop(0)
if url in self.visited_url :
continue
self.visited_url.add(url)
try:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
if is_scan_child :
for a in soup.find_all('a', href = True) :
absolute_url = urljoin(url, a['href'])
if absolute_url not in self.visited_url :
url_queue.append(absolute_url)
#print(f" Sub Urls are : {url_queue}")
tag = soup.body
text = ''.join(list(tag.strings)[:-1])
metadata = {
"path": url
}
out.append({"content": str(text), "metadata": metadata})
else:
logger.critical(f"Failed to retrieve content, status code: {response.status_code}")
except Exception as e:
logger.error(e)
return out
Since I thought we are dealing with subpages mostly I added this line and note the new import too:
from urllib.parse import urljoin
absolute_url = urljoin(url, a['href'])
I do not know if it is necessary or not.
@agberoz check this
@YashBaviskar1 It's looking good!
A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains.
@agberoz
Sure, we can do that using .netloc
to include and compare the domain of the urls
from urllib.parse import urlparse
#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc
and then add this condition when appending the child url
if is_scan_child :
for a in soup.find_all('a', href = True) :
absolute_url = urljoin(url, a['href'])
if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
url_queue.append(absolute_url)
if the absolute_url
is of different domain, it will not be added in the url_queue
and hence the crawler will stay on the base domain.
Ok i will soon send a PR if everything works! Thank you
The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag. This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.