The NewsLookout web scraping application gathers and classified financial events from public news websites and market data for India. It is a scalable, modular and configurable multi-threaded python console application. The application is readily extended by adding custom modules for web-scraping via its 'plugin' architecture. Plugins can be added for a variety of tasks, including - for scraping additional news sources, perform custom data pre-processing and run NLP based news text analytics such as - entity recognition, negative event classification, economy trends, industry trends, etc.
There are already a number of python libraries available for web-scraping, so why should you consider using this application for web scraping news? The reason is that it has been specifically built for sourcing financial news events and has several useful features. A few notable ones are:
Install the dependencies using pip:
pip install -r requirements.txt
Install the application via pip:
pip install newslookout
Caution: As a security best practice, it is strongly recommended to run the application under its own separate Operating System level user ID without any special privileges.
Next, create and configure separate locations for:
/var/cache/newslookout
/var/log/newslookout/newslookout.log
/var/run/newslookout.pid
Set these parameters in the configuration file.
Download the spacy model using this command:
python -m spacy download en_core_web_lg
For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:
To do this, you could either use the nltk downloader - run the following commands:
import nltk
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('reuters')
nltk.download('universal_treebanks_v20')
Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages
If these are not installed to one of the standard locations, you will need to set the NLTK_DATA environment variable to specify the location of this NLTK data. Refer to the instructions given at the NLTK website about downloading these model files - https://www.nltk.org/data.html.
All the parameters for the application can be configured via the configuration file. Both the configuration file, and the date for which the web scraper is to be run, are passed as command line arguments to the application.
The key parameters that need to be configured are:
prefix
data_dir
plugins_dir
plugins_contributed_dir
plugins
section as: plugin01=mod_my_plugin
proxy_url_https
log_level
Installing the wheel or via pip will generate a script newslookout
placed in your local folder.
This invokes the main method of the application and should be passed the two required arguments - configuration file and date for which the application is run.
For example:
newslookout -c myconfigfile.conf -d 2020-01-01
In addition to this, 2 scripts are provided for UNIX-like and Windows OS. For convenience, you may run these shell scripts to start the application, it automatically generates the current date and supplies it as an argument to the python application. Its best advised to run the scripts or command line by scheduling it via the UNIX cron scheduler or the Microsoft Windows Task Scheduler for automated scheduling for small setups. In large enterprise environments, batch job coordination software such as Ctrl-M, IBM Tivoli, or any job scheduling framework may be configured to run it for reliable and automated execution.
The application creates a PID file (process identifier) upon startup to prevent launching multiple instances at the same time. On startup, it checks if this file exists, if it does then the application will stop. If the application is killed or shuts down abruptly without cleaning up, this PID file will remain and will need to be manually deleted.
The application displays its progress on stdout, for example:
NewsLookout Web Scraping Application, Version 1.9.9
Python version: 3.8.5 (Linux)
Run date: 2021-06-10
Reading configuration from: conf/newslookout.conf
Logging events to file: logs/newslookout.log
Using PID file: data/newslookout.pid
URLs identified: 100%|██████████████████████████████████████████████████████████| 14/14 [1h 48:14<00:00, 0.00 Plugins/s]
Data downloaded: 100%|██████████████████████████████████████████████████████| 1474/1474 [1h 48:14<00:00, 0.23 URLs/s]
Data processed: 38%|███████████████████▌ | 384/1007 [1h 48:14<2h 55:36, 0.06 Files/s]
For a more detailed log of events, refer to the log file. It captures all events with timestamp and the relevant name of the module that generated the event.
2021-01-01 01:31:50:[INFO]:queue_manager:4360: 13 worker threads available to fetch content. ... 2021-01-01 02:07:51:[INFO]:worker:320: Progress Status: 1117 URLs successfully scraped out of 1334 processed; 702 URLs remain. ... 2021-01-01 03:02:10:[INFO]:queue_manager:5700: Completed fetching data on all worker threads
You can extend the web scraper's functionality to add any additional website that you need scraped by using the template file template_for_plugin.py
from the plugins_contrib
folder and customising it.
Name your custom plugin file with the same name as the name of the class object.
Place it in the plugins_contrib
folder (or whichever folder you have set in the config file).
Next, add the plugin's name in the configuration file.
It will be read, instantiated and run automatically by the application on the next startup.
Take a look at one of the already implemented plugins code for examples of how a plugin can be written.
The application will automatically rotate the log file upon reaching the set maximum size. The data directory will need to be monitored since its size could grow quickly due to the data scraped from the web.
For enterprise installations, log watch may be enabled for monitoring the operation of the application by watching for specific event entries in the log file.
The data folder should be monitored for growth in its size.
In case news portals change their structure, the web scraper code for their respective plugin will need to be updated to continue retrieving information reliably. This needs careful monitoring of the output to keep checking for parsing related problems.