soumilsuri / SENTINEL

(Summarizing and Extracting News Information System): Implies a vigilant guardian that keeps watch over news.
2 stars 1 forks source link

find a better approach for function extract_news_links(self, soup): in googleNewsExtractor.py #1

Open soumilsuri opened 2 months ago

soumilsuri commented 2 months ago
'''
Google News might change its html layout in future. So we might need to change this function(extract_news_links) in future accordingly.To do this inspect the google news page and navigate to the news link. copy the class name and paste it here.
'''
def extract_news_links(self, soup):
    links = set()
    for a_tag in soup.find_all('a', href=True):
        if 'WwrzSb' in a_tag.get('class', []):  # Check if class contains 'WwrzSb'(change it in future if the code stops working)
            href = a_tag['href']
            if href.startswith('./read'):
                full_url = "https://news.google.com" + href[1:]
                links.add(full_url)
    return list(links)
UTSAVS26 commented 2 months ago

@soumilsuri assign this issue to me.