nirbarazida / Data-mining-project

ITC - Data mining project
0 stars 1 forks source link

Data-mining-project

ITC - Data mining project - StackExchange Analyse.
main focus - Stack Overflow

Authors

Nir Barazida and Inbar Shirizly

Goals

This project scrapes websites and analyses the data retrieved.

The websites that are analysed are under the group of StackExchange main websites:

  1. https://stackoverflow.com/
  2. https://math.stackexchange.com/
  3. https://askubuntu.com/
  4. https://superuser.com/

The analysis focuses on data retrieved from the top individual users of several websites (according to the website's all-time rank (since the website's establishment until scrapper was run)).

Main insights the program attempts to present:

Program work flow

image

For now, this project is in milestone 3, hence the program crawls from the input websites and commit the data to the user MySQL data-base.
In the future, the program will store the data on a remote data base that is located on a server, and display the insights in a dashboard

Project implementation

The project implementation plan is to use OOP because of it's diversity and time optimization.\ The opportunity to implement scraping features on different websites, using the same project with minor changes in the HTML page, gives the project a significant advantage.

To approach the diversity problem we decided to create 3 different class:

image

Database - ERD

image

Tables description:

Features

In the command line arguments the user will be able to use the following features:

Files

Sources

Corey Schafer - Python tutorials:
Web scraping:

Web scraping with Python from   A to Z, ITC

Defining schema using ORM:
API: