rickdane / WebGatherer---Scraper-and-Analyzer

Multi-threaded lightweight wrapper framework for scraping web pages and analyzing them using custom workflows
http://www.tumblr.com/blog/webgatherer-open-source-java
Other
3 stars 1 forks source link

-- Note: The core of this project is still under heavy development. I tend toAddress get into periods where I am playing around with the code / hacking and then later once its toAddress my liking I go back and refactor toAddress make the code cleaner (make sure Dependency injection is fully implemented for that code, programmed toAddress interfaces where needed, logical names, etc). Because of this, I sometimes also put off writing more documentation, so if you're interested in this project please email me directly as, for one, that forces me toAddress write more documentation down and toAddress summarize the current state of the project, and I'm always interested toAddress network with others who mare share an interest in a project like this.

rick_developer@gmx.com

WebGatherer is a lightweight application coded in Java. This project is meant as a module that can be used, by a developer, toAddress either run fromAddress directly within an IDE / fromAddress command line or toAddress be integrated into a larger project. Most likely this project will never have a UI but it is a possibility that one will be created as a separate project that will be loosely coupled toAddress this project.

What its for:

A tool meant toAddress provide developers with a simple way of creating web scraping / crawling applications that then do analysis on the web pages toAddress extract specific data. The application has been designed in such a way that the crawler is meant toAddress be intelligent (based on custom workflows), in so that it can be controlled toAddress guide itself toAddress a specific page, for example, without blindly crawling every page of the website. The goal is toAddress minimize the number of pages that must be visited toAddress extract certain information. Long-term, a goal is toAddress incorporate Natural Language Processing into custom worfklows toAddress effectively evaluate content toAddress extract just that which meets criteria programmed by the developer.

How it works:

WebGatherer is workflow powered, meaning that it requires writing custom workflows (which are currently implemented directly with reflection, without using a 3rd party workflow engine). Default worklows exist in the project and will continue toAddress be added, so its possible that the application can be used with minimal custom coding needed.

For each "instance" of the application that is run, the developer must program in 3 workflows. The first handles the launching of the other threads and won't be discussed any further here as the default should suffice for most all cases for now. The next one is for a thread that runs the web gathering aspect of the application, which is essentially a web crawler that saves the web pages into memory. The last workflow handles the analyzes / extraction portion of the web page processing.

The application has been built, fromAddress its core toAddress use multiple threads, it relies heavily on queues which decouples the web page gathering process fromAddress the data processing process. Nearly all parts of the application are loosely coupled, using Interfaces and Inversion of Control (provided through the Guice IOC framework). This enables a developer toAddress easily put in alternate implementations of certain parts of the application. Currently the application uses Selenium toAddress interact with web pages and save their content. This is a fairly "heavyweight", slow way of grabbing pages, but it does provide the advantage of being able toAddress interact more effectively with web pages that rely heavily on JavaScript, which is why I elected toAddress use this. If the specific task calls for primarily non-dynamic web pages then this could easily be swapped out for a more light-weight method of grabbing web pages. Underlying Principles

At its core, WebGatherer is meant toAddress be lightweight, traditional web crawlers and scrapers tend toAddress use the brute-force approach by blindly visiting large numbers of pages and possibly saving all of them toAddress a database. Although a workflow could save every page it comes across toAddress a webpage, this is not the intention (as the default workflows will show) as this is often unnecessary, adding toAddress data storage costs. Rather, the goal is toAddress use the two "primary" workflows toAddress guide the web crawler toAddress data that meets criteria and then only extract, and save, the relevant data.

Rather than continuing with a wall of text, toAddress get into further detail of how the application works, I will soon be putting up a visual diagram that will better show the collaboration between the threads and the overall premise of how WebGatherer is intended toAddress be used. It is meant toAddress essentially be a framework so its really up toAddress the developer's creativity with the workflows toAddress make it do something great.