sandeep-sandhu / NewsLookout

The NewsLookout web scraping application with NLP and data pre-processing
GNU General Public License v3.0
3 stars 2 forks source link
news nlp python

Build Status GitHub release Coverage Status Pypi Release Python Versions Contributors

NewsLookout Web Scraping Application

The NewsLookout web scraping application gathers and classified financial events from public news websites and market data for India. It is a scalable, modular and configurable multi-threaded python console application. The application is readily extended by adding custom modules for web-scraping via its 'plugin' architecture. Plugins can be added for a variety of tasks, including - for scraping additional news sources, perform custom data pre-processing and run NLP based news text analytics such as - entity recognition, negative event classification, economy trends, industry trends, etc.

Features

There are already a number of python libraries available for web-scraping, so why should you consider using this application for web scraping news? The reason is that it has been specifically built for sourcing financial news events and has several useful features. A few notable ones are:

Installation

Install the dependencies using pip:

pip install -r requirements.txt

Install the application via pip:

pip install newslookout

Caution: As a security best practice, it is strongly recommended to run the application under its own separate Operating System level user ID without any special privileges.

Next, create and configure separate locations for:

Set these parameters in the configuration file.

NLP Data

Download the spacy model using this command:

python -m spacy download en_core_web_lg

For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:

  1. reuters
  2. universal_treebanks_v20
  3. maxent_treebank_pos_tagger
  4. punkt

To do this, you could either use the nltk downloader - run the following commands:

import nltk
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('reuters')
nltk.download('universal_treebanks_v20')

Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages

If these are not installed to one of the standard locations, you will need to set the NLTK_DATA environment variable to specify the location of this NLTK data. Refer to the instructions given at the NLTK website about downloading these model files - https://www.nltk.org/data.html.

Configuration

All the parameters for the application can be configured via the configuration file. Both the configuration file, and the date for which the web scraper is to be run, are passed as command line arguments to the application.

The key parameters that need to be configured are:

  1. Application root folder: prefix
  2. Data directory: data_dir
  3. Plugin directory: plugins_dir
  4. Contributed Plugins: plugins_contributed_dir
  5. Enabled plugins: Add the name of the python file (without file extension) under the plugins section as: plugin01=mod_my_plugin
  6. Network proxy (if any): proxy_url_https
  7. The level of logging: log_level

Usage

Installing the wheel or via pip will generate a script newslookout placed in your local folder. This invokes the main method of the application and should be passed the two required arguments - configuration file and date for which the application is run. For example:

newslookout -c myconfigfile.conf -d 2020-01-01

In addition to this, 2 scripts are provided for UNIX-like and Windows OS. For convenience, you may run these shell scripts to start the application, it automatically generates the current date and supplies it as an argument to the python application. Its best advised to run the scripts or command line by scheduling it via the UNIX cron scheduler or the Microsoft Windows Task Scheduler for automated scheduling for small setups. In large enterprise environments, batch job coordination software such as Ctrl-M, IBM Tivoli, or any job scheduling framework may be configured to run it for reliable and automated execution.

PID File

The application creates a PID file (process identifier) upon startup to prevent launching multiple instances at the same time. On startup, it checks if this file exists, if it does then the application will stop. If the application is killed or shuts down abruptly without cleaning up, this PID file will remain and will need to be manually deleted.

Console Display

The application displays its progress on stdout, for example:

NewsLookout Web Scraping Application, Version  1.9.9
Python version:  3.8.5 (Linux)
Run date: 2021-06-10
Reading configuration from: conf/newslookout.conf
Logging events to file: logs/newslookout.log
Using PID file: data/newslookout.pid

URLs identified: 100%|██████████████████████████████████████████████████████████| 14/14 [1h 48:14<00:00, 0.00 Plugins/s]
Data downloaded: 100%|██████████████████████████████████████████████████████| 1474/1474 [1h 48:14<00:00, 0.23    URLs/s]
 Data processed:  38%|███████████████████▌                               |  384/1007 [1h 48:14<2h 55:36, 0.06   Files/s]

Event Log

For a more detailed log of events, refer to the log file. It captures all events with timestamp and the relevant name of the module that generated the event.

2021-01-01 01:31:50:[INFO]:queue_manager:4360: 13 worker threads available to fetch content.
...
2021-01-01 02:07:51:[INFO]:worker:320: Progress Status: 1117 URLs successfully scraped out of 1334 processed; 702 URLs remain.
...
2021-01-01 03:02:10:[INFO]:queue_manager:5700: Completed fetching data on all worker threads

Customizing and Writing your own Plugins

You can extend the web scraper's functionality to add any additional website that you need scraped by using the template file template_for_plugin.py from the plugins_contrib folder and customising it. Name your custom plugin file with the same name as the name of the class object. Place it in the plugins_contrib folder (or whichever folder you have set in the config file). Next, add the plugin's name in the configuration file. It will be read, instantiated and run automatically by the application on the next startup.

Take a look at one of the already implemented plugins code for examples of how a plugin can be written.

Maintenance and Monitoring

Data Size

The application will automatically rotate the log file upon reaching the set maximum size. The data directory will need to be monitored since its size could grow quickly due to the data scraped from the web.

Event Monitoring

For enterprise installations, log watch may be enabled for monitoring the operation of the application by watching for specific event entries in the log file.

The data folder should be monitored for growth in its size.

HTML parsing code updates

In case news portals change their structure, the web scraper code for their respective plugin will need to be updated to continue retrieving information reliably. This needs careful monitoring of the output to keep checking for parsing related problems.