This PR introduces the functionality to store Google News feeds into a MongoDB database using PyMongo. The key enhancements and changes are detailed below:
MongoDB Integration:
Introduced MongoDB client initialization using pymongo.MongoClient with connection strings and database names loaded from environment variables (DB_STRING and DB_NAME).
Added error handling for MongoDB connection errors.
Upsert Capability:
Added a new parameter upsert to the GNews class, which when set to True, allows the news articles to be upserted into a specified MongoDB collection.
Implemented the upsert_news method which inserts new articles or skips duplicates based on the article title.
Date Range Support:
Enhanced the _ceid method to support date range queries with start_date and end_date.
Added properties and setters for start_date and end_date with appropriate validation and warning messages.
Query Methods:
Modified get_news, get_top_news, get_news_by_topic, get_news_by_location, and get_news_by_site methods to support upserting articles into MongoDB when the upsert flag is set and collection_name is provided.
Utility and Helper Methods:
Added _clean method to clean HTML content from descriptions.
Modified _process method to filter out excluded websites and prepare article data for MongoDB insertion.
Logging and Warnings:
Implemented logging for various events such as MongoDB connection errors, invalid topics or locations, and skipping of duplicate articles.
Added warnings for date range issues to guide users on how to properly set start_date and end_date.
How to Test:
Environment Setup:
Ensure DB_STRING and DB_NAME environment variables are set for MongoDB connection.
Install required dependencies including pymongo, feedparser, beautifulsoup4, and dotenv.
Initialization:
Instantiate the GNews class with upsert=True.
gnews = GNews(upsert=True)
Fetching and Upserting News:
Use methods like get_news, get_top_news, get_news_by_topic, get_news_by_location, or get_news_by_site with a collection_name parameter to fetch and upsert news articles.
Verify the MongoDB collection specified by collection_name to ensure news articles are correctly inserted and duplicates are skipped.
Additional Notes:
Ensure that the newspaper3k library is installed if using the get_full_article method for fetching and parsing full articles from URLs.
Proper error handling and logging have been implemented to facilitate debugging and monitoring.
By merging this PR, we enable the GNews class to store fetched news feeds into a MongoDB database, thus providing a persistent and scalable solution for managing news data.
This PR introduces the functionality to store Google News feeds into a MongoDB database using PyMongo. The key enhancements and changes are detailed below:
MongoDB Integration:
pymongo.MongoClient
with connection strings and database names loaded from environment variables (DB_STRING
andDB_NAME
).Upsert Capability:
upsert
to theGNews
class, which when set toTrue
, allows the news articles to be upserted into a specified MongoDB collection.upsert_news
method which inserts new articles or skips duplicates based on the article title.Date Range Support:
_ceid
method to support date range queries withstart_date
andend_date
.start_date
andend_date
with appropriate validation and warning messages.Query Methods:
get_news
,get_top_news
,get_news_by_topic
,get_news_by_location
, andget_news_by_site
methods to support upserting articles into MongoDB when theupsert
flag is set andcollection_name
is provided.Utility and Helper Methods:
_clean
method to clean HTML content from descriptions._process
method to filter out excluded websites and prepare article data for MongoDB insertion.Logging and Warnings:
start_date
andend_date
.How to Test:
Environment Setup:
DB_STRING
andDB_NAME
environment variables are set for MongoDB connection.pymongo
,feedparser
,beautifulsoup4
, anddotenv
.Initialization:
GNews
class withupsert=True
.Fetching and Upserting News:
get_news
,get_top_news
,get_news_by_topic
,get_news_by_location
, orget_news_by_site
with acollection_name
parameter to fetch and upsert news articles.Database Verification:
collection_name
to ensure news articles are correctly inserted and duplicates are skipped.Additional Notes:
newspaper3k
library is installed if using theget_full_article
method for fetching and parsing full articles from URLs.By merging this PR, we enable the
GNews
class to store fetched news feeds into a MongoDB database, thus providing a persistent and scalable solution for managing news data.